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14.  ABSTRACT 

As  computers  have  become  a  pivotal  component  of  daily  lives,  computer  safety,  reliability,  and  security 
issues  have  become  enormously  important.  A  considerable  amount  of  recent  research  in  program  analysis 
and  software  engineering  has  been  carried  out  on  techniques  and  tools  for  finding  software  bugs  and 
security  vulnerabilities,  and  on  checking  computer-safety  properties.  Most  of  this  research  has  focused  on 
analyzing  source  code.  Recently,  machine-code  analysis  has  begun  to  receive  great  attention  both  because 
source  code  is  often  unavailable  and  because  there  can  be  mismatches  in  various  ways  between  source  code 
and  the  machine  code  generated  from  the  source  code.  The  tools  and  techniques  for  analyzing  machine 
code  are,  in  principle,  language-independent.  However,  their  implementations  are  often  tied  to  one  specific 
instruction  set.  Retargeting  them  to  another  instruction  set  can  be  an  expensive  and  error-prone  process. 
This  dissertation  describes  a  system  that  I  developed,  called  TSL  (for  ?Transformer  Specification 
Language?)  that  provides  a  systematic  solution  to  the  problem  of  creating  retargetable  tools  for  analyzing 
machine-code.  The  TSL  system  is  a  meta-tool,  or  tool  generator,  that  automatically  creates  different 
abstract  interpreters  for  machine-code  instruction  sets.  The  system  addresses  the  problem  of  supporting 
multiple  instruction  sets  by  providing  a  YACC-like  mechanism  for  creating  key  components  of 
machine-code  analyzers.  The  TSL  system  takes  a  single,  unified  description  of  the  concrete  operational 
semantics  of  an  instruction  set,  which  is  specified  in  TSL,  a  strongly  typed,  first-order  functional  language, 
and  automatically  creates  implementations  of  different  abstract  interpreters  for  the  given  instruction  set. 
TSL  provides  a  fixed  set  of  base-types  and  operators,  as  well  as  map-types  with  map-access  and 
(applicative)  map-update  operations.  The  TSL  compiler  generates  a  common  intermediate  representation 
that  allows  the  meanings  of  the  input-language  constructs  to  be  redefined  by  supplying  alternative 
interpretations  of  the  base-types,  map-types,  and  the  operations  on  them  (?semantic  reinterpretation?). 
Because  all  the  abstract  operations  are  defined  at  the  meta-level,  a  semantic  reinterpretation  is 
independent  of  any  given  instruction  set  defined  in  TSL.  Therefore,  each  implementation  of  an  analysis 
component?s  driver  serves  as  the  unchanging  driver  for  use  in  different  instantiations  of  the  analysis 
component  for  different  instruction  sets.  The  TSL  language  becomes  the  specification  language  for 
retargeting  that  analysis  component  to  different  instruction  sets.  As  an  application  of  the  TSL  system,  we 
developed  a  novel  way  of  applying  semantic 
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ABSTRACT 


As  computers  have  become  a  pivotal  component  of  daily  lives,  computer  safety,  reliability,  and 
security  issues  have  become  enormously  important.  A  considerable  amount  of  recent  research  in 
program  analysis  and  software  engineering  has  been  carried  out  on  techniques  and  tools  for  finding 
software  bugs  and  security  vulnerabilities,  and  on  checking  computer-safety  properties.  Most  of 
this  research  has  focused  on  analyzing  source  code.  Recently,  machine-code  analysis  has  begun 
to  receive  great  attention  both  because  source  code  is  often  unavailable  and  because  there  can  be 
mismatches  in  various  ways  between  source  code  and  the  machine  code  generated  from  the  source 
code. 

The  tools  and  techniques  for  analyzing  machine  code  are,  in  principle,  language-independent. 
However,  their  implementations  are  often  tied  to  one  specific  instruction  set.  Retargeting  them  to 
another  instruction  set  can  be  an  expensive  and  error-prone  process.  This  dissertation  describes 
a  system  that  I  developed,  called  TSL  (for  “Transformer  Specification  Language”)  that  provides 
a  systematic  solution  to  the  problem  of  creating  retargetable  tools  for  analyzing  machine-code. 
The  TSL  system  is  a  meta-tool,  or  tool  generator,  that  automatically  creates  different  abstract 
interpreters  for  machine-code  instruction  sets.  The  system  addresses  the  problem  of  supporting 
multiple  instruction  sets  by  providing  a  YACC-like  mechanism  for  creating  key  components  of 
machine-code  analyzers.  The  TSL  system  takes  a  single,  unified  description  of  the  concrete  op¬ 
erational  semantics  of  an  instruction  set,  which  is  specified  in  TSL,  a  strongly  typed,  first-order 
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functional  language,  and  automatically  creates  implementations  of  different  abstract  interpreters 
for  the  given  instruction  set. 

TSL  provides  a  fixed  set  of  base-types  and  operators,  as  well  as  map-types  with  map-access  and 
(applicative)  map-update  operations.  The  TSL  compiler  generates  a  common  intermediate  repre¬ 
sentation  that  allows  the  meanings  of  the  input-language  constructs  to  be  redefined  by  supplying 
alternative  interpretations  of  the  base-types,  map-types,  and  the  operations  on  them  (“semantic 
reinterpretation”).  Because  all  the  abstract  operations  are  defined  at  the  meta-level ,  a  semantic 
reinterpretation  is  independent  of  any  given  instruction  set  defined  in  TSL.  Therefore,  each  imple¬ 
mentation  of  an  analysis  component’s  driver  serves  as  the  unchanging  driver  for  use  in  different 
instantiations  of  the  analysis  component  for  different  instruction  sets.  The  TSL  language  becomes 
the  specification  language  for  retargeting  that  analysis  component  to  different  instruction  sets. 

As  an  application  of  the  TSL  system,  we  developed  a  novel  way  of  applying  semantic  reinter¬ 
pretation  to  automatically  create  symbolic-analysis  primitives  for  symbolic  evaluation,  weakest- 
liberal  precondition,  and  symbolic  composition.  Furthermore,  using  the  TSL  system,  as  well  as 
the  TSL-generated  symbolic-analysis  primitives,  we  developed  a  machine-code  verification  tool, 
called  MCVETO,  and  a  concolic-execution-based  program-exploration  tool,  called  BCE. 

•  MCVETO  addresses  a  large  number  of  issues  that  arise  when  developing  model-checking 
tools  for  machine  code,  for  which  standard  techniques  used  in  source-code  model-checking 
tools  would  be  unsound  if  applied  to  machine  code. 

•  What  distinguishes  the  work  on  BCE  is  that  it  makes  use  of  control-dependence  information 
to  make  program  exploration  goal-directed  toward  a  given  set  of  targets. 
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Chapter  1 
Introduction 

As  computers  have  become  a  pivotal  component  of  daily  lives,  computer  safety,  reliability,  and  se¬ 
curity  issues  have  become  enormously  important.  To  address  these  issues,  a  considerable  amount 
of  research  has  been  carried  out  recently  in  the  programming -language  and  software-engineering 
communities  on  techniques  for  finding  software  bugs  and  security  vulnerabilities,  and  on  checking 
computer-safety  properties.  This  work  has  led  to  a  large  number  of  program-analysis  techniques 
and  tools.  Essentially  all  of  the  results  described  in  the  literature  are,  in  principle,  language- 
independent;  however,  their  implementations  are  often  tied  to  one  specific  language.  Retargeting 
them  to  another  language  (as  well  as  implementing  a  new  analysis  for  the  same  language)  can  be 
an  expensive  and  error-prone  process.  For  machine-code  analyses,  having  a  language-dependent 
implementation  is  even  worse  than  for  source-code  analyses  because  instruction  sets  usually  con¬ 
tain  several  hundred  kinds  of  instructions,  and  a  given  instruction  set  often  has  special  features  not 
found  in  other  instruction  sets. 

This  dissertation  describes  a  system  that  I  developed,  called  TSL  (for  “Transformer 
Specification  Language”),  which  helps  in  the  creation  of  tools  for  analyzing  machine  code.  The 
TSL  system  is  a  meta-tool,  or  tool  generator,  that  automatically  creates  different  abstract  inter¬ 
preters  for  machine-code  instruction  sets.  The  system  addresses  the  problem  of  supporting  multiple 
instruction  sets  by  providing  a  YACC-like  mechanism  for  creating  key  components  of  machine- 
code  analyzers.  The  TSL  system  takes  a  single,  unified  description  of  the  concrete  operational 
semantics  of  an  instruction  set,  and  automatically  creates  implementations  of  different  abstract 
interpreters  for  the  given  instruction  set. 
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An  instruction  set’s  concrete  semantics  is  specified  in  TSL’s  input  language,  which  is  a  strongly 
typed,  first-order  functional  language  with  a  datatype-definition  mechanism  for  defining  recursive 
datatype,  plus  deconstruction  by  means  of  pattern  matching.  Writing  a  TSL  specification  for  an 
instruction  set  is  similar  to  writing  an  interpreter  in  first-order  ML. 

TSL  provides  a  fixed  set  of  base-types  and  operators,  as  well  as  map-types  with  map-access  and 
(applicative)  map-update  operations.  From  a  TSL  specification,  the  TSL  compiler  generates  a  com¬ 
mon  intermediate  representation  (CIR)  that  allows  the  meanings  of  the  input-language  constructs 
to  be  redefined  by  supplying  alternative  interpretations  of  the  base-types,  map-types,  and  the  oper¬ 
ations  on  them  (“semantic  reinterpretation”).  Because  all  the  abstract  operations  are  defined  at  the 
meta-level ,  semantic  reinterpretation  is  independent  of  any  given  instruction  set  defined  in  TSL. 
Therefore,  each  implementation  of  an  analysis  component’s  driver  serves  as  the  unchanging  driver 
for  use  in  different  instantiations  of  the  analysis  component  to  different  instruction  sets.  The  TSL 
language  becomes  the  specification  language  for  retargeting  that  analysis  component  for  different 
instruction  sets.  Thus,  to  create  M  x  N  analysis  components,  the  TSL  system  only  requires  M 
specifications  of  the  concrete  semantics  of  an  instruction  set,  and  N  analysis  implementations,  i.e., 
M  +  N  inputs  to  obtain  M  x  N  analysis-component  implementations. 

Applications.  As  one  application  of  the  TSL  system,  we  developed  a  novel  way  of  applying 
semantic  reinterpretation  to  automatically  create  symbolic-analysis  primitives  for  symbolic  eval¬ 
uation,  weakest-liberal  precondition,  and  symbolic  composition  (see  Chapter  4).  Furthermore, 
using  the  TSL  system,  as  well  as  the  TSL-generated  symbolic-analysis  primitives,  we  developed  a 
machine-code  verification  tool,  called  MCVETO  (§5.1),  and  a  concolic-execution-based  program- 
exploration  tool,  called  BCE  (§5.2). 

•  MCVETO  addresses  a  large  number  of  issues  that  arise  when  developing  model-checking 
tools  for  machine  code,  for  which  standard  techniques  used  in  source-code  model-checking 
tools  would  be  unsound  if  applied  to  machine  code.  These  include  (i)  the  absence  of  pre-built 
control-flow  graphs  and  call  graphs;  (ii)  the  absence  of  meta  data,  such  as  information  about 
variables,  types,  and  aliasing;  (iii)  the  absence  of  a  fixed  association  between  addresses  and 
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instructions  (e.g.,  instruction  aliasing  and  self-modifying  code)',  and  (iv)  extensive  use  of 
arithmetic  on  addresses  in  machine  code. 

•  What  distinguishes  the  work  on  BCE  is  that  it  makes  use  of  control-dependence  information 
to  make  program  exploration  goal-directed  toward  a  given  set  of  targets. 

The  remainder  of  this  chapter  is  organized  as  follows:  §1.1  discusses  the  challenge  of  software 
defects.  §1.2  discusses  a  few  of  the  approaches  in  the  program-analysis  literature  that  address 
the  problem  of  software  defects.  §1.3  focuses  on  machine-code  analysis  and  the  challenges  in 
implementing  machine-code  analyses.  §1.4  presents  an  overview  of  the  TSL  system.  §1.5  provides 
a  short  description  of  many  of  the  applications  to  which  I  have  applied  TSL.  §1.6  presents  the 
organization  of  the  dissertation  and  the  contribution  of  each  chapter. 

1.1  The  Challenge  of  Software  Defects 

Computers  are  pervasive  in  modem  life,  and  are  pivotal  components  in  a  wide  range  of  con¬ 
texts,  such  as  (to  name  just  a  few)  financial  systems,  power  systems,  manufacturing  systems, 
asset-management  systems,  health-care  systems,  and  many  critical  systems  (e.g.,  nuclear  reac¬ 
tors,  weapons  systems,  and  aircraft  collision-avoidance  systems).  Computer-safety,  reliability,  and 
security  issues  have  become  enormously  important  because  software  defects  and  security  vulnera¬ 
bilities  in  computer  systems  can  have  severe  consequences. 

Software  defects  (bugs)  and  violations  of  computer-safety  properties  can  cause  critical  failures 
(e.g.,  computer-system  crashes)  or  other  serious  failures  in  a  computer  system  (e.g.,  malfunctions 
due  to  mis-computations),  which  can  result  in  severe  damage  both  in  financial  terms  and  even  in 
lives  lost.  We  will  mention  just  two  cases  of  software  bugs  that  had  extreme  consequences.  For 
instance,  in  1996,  some  problems  with  a  rocket-launch  software  system  caused  a  rocket  that  was 
set  to  deliver  a  payload  of  satellites  into  Earth  orbit  to  veer  off  its  path  right  after  launch  and  self- 
destruct.  This  accident  caused  a  loss  of  more  than  $370  million  [3].  In  one  deadly  incident  in  1994 
in  Scotland,  a  system  error  caused  a  Chinook  helicopter  to  crash,  and  all  29  passengers  were  killed 
[1]. 
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Creating  a  correct  and  reliable  computer  system  is  becoming  extremely  difficult.  As  computer 
systems  become  more  and  more  complex,  even  experienced  software  developers  are  prone  to  in¬ 
troducing  bugs  into  their  products;  therefore,  the  potential  for  damage  of  the  kind  mentioned  above 
continues  to  be  a  serious  problem. 

A  security  vulnerability  in  a  software  component  is  a  flaw  that  can  possibly  be  exploited  by 
an  adversary  to  create  or  install  malware  (such  as  viruses,  worms,  Trojans,  bots,  or  back  doors) 
or  spyware;  for  conducting  illegitimate  activities,  such  as  damaging  and  disrupting  victims’  com¬ 
puters;  stealing  confidential  information,  including  passwords,  credit-card  numbers,  and  personal 
information;  destroying  important  data;  and  even  taking  control  of  a  compromised  computer. 

Annual  worldwide  economic  damages  from  malicious  software  (malware)  exceed  $13  billion 
according  to  a  survey  conducted  by  Computer  Economics  in  2007  (available  in  the  2007  Malware 
Report:  The  Economic  Impact  of  Viruses,  Spyware,  Adware,  Botnets,  and  Other  Malicious  Code 
[2]).  This  amount  includes  only  direct  damages,  such  as  loss  of  revenue  due  to  loss  or  degraded 
performance  of  systems,  labor  costs  to  analyze,  repair  and  cleanse  infected  systems,  and  loss  of 
user  productivity.  Total  damages  would  be  substantially  increased  when  indirect  damages  are 
considered. 

Anti-malware  technology  is  fairly  effective  in  defending  against  many  types  of  malware  threats. 
However,  traditional  signature-  and  heuristic-based  anti-malware  technologies  are  often  easily 
evaded,  and  thus  no  longer  enough  because  there  has  been  a  significant  increase  in  the  number 
of  zero-day  attacks  [22].  Zero-day  attacks  exploit  security  vulnerabilities  that  are  unknown  to  oth¬ 
ers  (including  the  original  software  developer),  and  for  which  no  security  fix  is  available  at  hand. 
Therefore,  it  becomes  more  important  to  detect  and  fix  security  vulnerabilities  before  software 
products  are  deployed,  or  before  an  adversary  can  exploit  them  to  attack  computer  systems. 

1.2  Program- Analysis  Approaches 

Program-analysis  technology  provides  a  promising  approach  to  addressing  the  problems  of 
finding  bugs  and  security  vulnerabilities,  and  for  validating  software  systems.  A  considerable 
amount  of  recent  research  in  the  programming-languages  and  software-engineering  communities 
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has  led  to  techniques  for  (i)  finding  bugs,  (ii)  finding  security  vulnerabilities,  and  (iii)  checking 
computer-safety  properties.  In  these  tools,  program  analysis  conservatively  answers  the  question 
“Can  the  program  reach  a  bad  state?”.  Although  one  cannot  make  an  absolute  distinction  among 
those  areas,  which  are  closely  related  to  each  other,  some  of  the  related  work  in  these  areas  can  be 
summarized  as  follows: 

•  Finding  bugs/generating  test  cases.  DART  (Directed  Automated  Random  Testing)  is  a  tool 
for  automated  testing  [93].  To  detect  bugs  that  can  cause  program  crashes  and  assertion 
violations,  DART  uses  a  combination  of  concrete  execution  and  symbolic  execution  to  sys¬ 
tematically  explore  a  program’s  state  space.  It  uses  symbolic  execution  to  find  inputs  that 
direct  execution  along  alternative  paths. 

CBI  (Cooperative  Bug  Isolation)  is  a  feedback-directed  approach  to  finding  bugs  [124].  In¬ 
strumented  applications  are  deployed  to  the  general  public,  and  then  some  statistical  methods 
are  applied  to  mine  returned  data  for  information  about  root  causes  of  failures. 

There  are  several  other  analysis  tools  for  bug-finding  and  test-generation  [52,  104,  1 19]. 

•  Finding  security  vulnerabilities.  BOON  [181]  is  a  static-analysis  technique  for  determining 
whether  a  C  program  can  index  an  array  outside  its  bounds.  CQual  [88]  is  a  type-based 
analysis  tool  that  provides  a  lightweight,  practical  mechanism  for  specifying  and  check¬ 
ing  properties  of  C  programs.  It  uses  type  qualifiers  to  perform  taint  analysis,  and  detects 
format-string  vulnerabilities  in  C  programs.  Eau  Claire  [64]  is  a  tool  for  finding  common  se¬ 
curity  problems  like  buffer  overflows,  file-access  race  conditions,  and  format-string  bugs.  It 
uses  a  theorem  prover  to  create  a  general  specification-checking  framework  for  C  programs. 
Livshits  proposed  a  static-analysis  technique  for  detecting  security  vulnerabilities  that  stem 
from  unchecked  input  in  Java  applications  [130].  There  are  several  other  analysis  tools  for 
finding  security  vulnerabilities  [37,  122]. 

•  Checking  safety  properties.  Havelund  et  al.  presented  a  system  called  Java  PathFinder,  a 
model-checker  for  Java  bytecode  programs  [100].  Ball  et  al.  developed  the  Static  Driver 
Verifier  (SDV),  which  analyzes  device-driver  source  code  to  determine  whether  there  is  a 
path  in  the  driver  that  violates  a  kernel  API  usage  rule  [47].  MOPS  uses  model-checking 
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techniques  to  check  certain  kinds  of  security  properties,  represented  as  a  finite-state  automata 
[63].  There  are  many  other  analysis  tools  in  this  space  [57,  63,  76,  102,  186].  These  tools  are 
based  on  static  analysis ,  which  is  used  to  determine  a  conservative  answer  to  the  question 
“Can  the  program  reach  a  bad  state?” 

These  tools  all  focus  on  analyzing  source  code  written  in  high-level  languages,  such  as  C,  Java, 
etc.  However,  the  problem  of  analyzing  machine  code  to  find  bugs  and  security  vulnerabilities, 
and  to  recover  other  information  about  their  execution  properties,  has  been  receiving  increased 
attention  for  the  following  reasons: 

•  Computers  do  not  execute  source  code.  Instead,  the  actual  code  that  a  computer  executes 
is  the  machine  code  produced  by  a  compiler  (and  an  optimizer)  from  the  source  code.  In 
the  process  of  compiling  and  optimizing  source  code,  subtle  flaws  depending  on  low-level, 
platform-specific  details,  such  as  memory  layout,  can  be  introduced.  Consequently,  there 
can  be  various  vulnerabilities  that  are  invisible  in  the  original  source  code.  Also,  programs 
may  be  modified  to  insert  malicious  code.  Balakrishnan  et  al.  referred  to  such  a  situation  as 
the  WYSINWYX  phenomenon  (What  You  See  Is  Not  What  You  eXecute)  [38,  39,  41]. 

•  Source  code  is  often  unavailable  to  analyze.  For  instance,  Commercial-Off-The-Shelf 
(COTS)  applications  are  typically  delivered  as  stripped  machine  code  (i.e.,  neither  source 
code  nor  symbol-table/debugging  information  is  provided).  Also,  malicious  code  such  as 
bots  and  backdoors  are  in  binary  form  and  no  source  code  for  them  is  available. 

•  A  program  can  be  written  in  more  than  one  language,  which  complicates  the  lives  of  develop¬ 
ers  of  source-level  tools.  Also,  when  a  program  contains  inlined  assembly  code,  source-code 
analysis  typically  either  ignores  that  part  or  does  not  push  the  analysis  beyond  it,  which  can 
make  the  results  of  the  analysis  unsound. 

•  Analyses  based  on  source  code  typically  make  unchecked  assumptions,  e.g.,  that  the  program 
is  ANSI-C  compliant.  This  often  means  that  an  analysis  does  not  account  for  behaviors  that 
are  allowed  by  the  compiler  (e.g.,  arithmetic  is  performed  on  pointers  that  are  subsequently 
used  for  indirect  function  calls;  pointers  move  off  the  ends  of  arrays  and  are  subsequently 
dereferenced;  etc.). 
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In  these  situations,  the  availability  of  good  source-level  analysis  tools  is  irrelevant;  instead,  one 
needs  tools  capable  of  analyzing  machine  code. 

1.3  Machine-Code  Analysis 

The  aforementioned  issues  that  arise  when  analyzing  source  code  disappear  when  analyzing 
machine  code.  Furthermore,  machine-code  analysis  has  the  advantage  that  it  can  provide  more 
accurate  information  than  a  source-level  analysis  can  because,  for  many  programming  languages, 
certain  behaviors  are  left  under-specified  by  the  semantics.  In  such  cases,  a  source-level  analysis 
must  account  for  all  possible  behaviors,  whereas  an  analysis  of  machine  code  generally  only  has 
to  deal  with  one  possible  behavior,  namely,  the  one  for  the  code  sequence  chosen  by  the  compiler. 
Chapter  2  discusses  machine-code  analysis  in  more  detail. 

There  have  been  several  specialized  analyses  of  machine  code  developed  to  identify  aliasing 
relationships  [80],  data  dependences  [36,  70],  targets  of  indirect  calls  [79],  values  of  strings  [68], 
bounds  on  stack  height  [159],  and  values  of  parameters  and  return  values  [190]. 

In  contrast  to  such  specialized  analyses,  Balakrishnan  and  Reps  [38,  41]  developed  ways  to 
address  all  of  these  problems  by  means  of  an  analysis  that  discovers  an  over-approximation  of 
the  set  of  states  that  can  be  reached  at  each  point  in  the  executable — where  a  state  means  all 
of  the  components  of  a  state:  values  of  registers,  flags,  and  the  contents  of  memory.  Moreover, 
their  approach  is  able  to  be  applied  to  stripped  executables  (i.e.,  neither  source  code  nor  symbol- 
table/debugging  information  need  be  available). 

Challenges  in  implementing  machine-code  analysis.  Machine-code  analysis  presents  many 
new  challenges.  For  instance,  at  the  machine-code  level,  memory  is  one  large  byte-addressable 
array,  and  an  analyzer  must  handle  computed — and  possibly  non-aligned — addresses.  It  is  crucial 
to  track  array  accesses  and  updates  accurately;  however,  the  task  is  complicated  by  the  fact  that 
arithmetic  and  dereferencing  operations  are  both  pervasive  and  inextricably  intermingled.  For 
instance,  if  local  variable  x  is  at  offset  -12  from  the  activation  record’s  frame  pointer  (register  ebp), 
an  access  on  x  would  be  turned  into  an  operand  [ebp-12].  Evaluating  the  operand  first  involves 
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pointer  arithmetic  (“ebp-12”)  and  then  dereferencing  the  computed  address  (“[•]”).  On  the  other 
hand,  machine-code  analysis  also  offers  new  opportunities,  in  particular,  the  opportunity  to  track 
low-level,  platform- specific  details,  such  as  memory-layout  effects.  Programmers  are  typically 
unaware  of  such  details;  however,  they  are  often  the  source  of  exploitable  security  vulnerabilities. 

Many  of  the  algorithms  used  in  software  model  checkers  that  work  on  source  code  [47, 49,  102] 
would  be  unsound  if  applied  to  machine  code.  For  instance,  before  starting  the  verification  pro¬ 
cess  proper,  SLAM  [47]  and  BLAST  [102]  perform  flow-insensitive  (and  optionally  field-sensitive) 
points-to  analysis.  However,  such  analyses  often  make  unsound  assumptions,  such  as  assuming 
that  the  result  of  an  arithmetic  operation  on  a  pointer  always  remains  inside  the  pointer’s  original 
target.  Such  an  approach  assumes — without  checking — that  the  program  is  ANSI  C  compliant, 
and  hence  causes  the  model  checker  to  ignore  behaviors  that  are  allowed  by  some  compilers  (e.g., 
arithmetic  is  performed  on  pointers  that  are  subsequently  used  for  indirect  function  calls;  pointers 
move  off  the  ends  of  structs  or  arrays,  and  are  subsequently  dereferenced).  A  program  can  use 
such  features  for  good  reasons — e.g.,  as  a  way  for  a  C  program  to  simulate  subclassing  [172] — but 
they  can  also  be  a  source  of  bugs  and  security  vulnerabilities. 

Although  techniques  developed  in  prior  work  on  machine-code  analysis  are,  in  principle, 
language-independent,  they  have  typically  only  been  instantiated  for  one  instruction  set  (mostly  the 
Intel  IA32  instruction  set).  This  situation  is  actually  typical  of  much  work  on  source-code  program 
analysis,  too:  even  though  the  techniques  described  in  the  literature  are,  in  principle,  language- 
independent,  their  implementations  are  often  tied  to  a  specific  language  or  intermediate  represen¬ 
tation.  This  state  of  affairs  reduces  the  impact  that  good  ideas  developed  in  one  context  have  in 
other  contexts.  The  situation  is  more  serious  for  low-level  instruction  sets,  because  (i)  instruction 
sets  usually  contain  several  hundred  instructions,  and  (ii)  there  are  a  variety  of  architecture-specific 
features  that  are  incompatible  with  other  architectures. 

1.4  Transformer  Specification  Language  (Tsl) 

To  address  the  issues  mentioned  above,  my  work  has  aimed  to  provide  a  systematic  way  of  im¬ 
plementing  analyzers  that  work  on  machine  code.  As  part  of  my  research,  I  developed  a  language 
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for  specifying  the  semantics  of  an  instruction  set,  along  with  a  run-time  system  to  support  dynamic 
analysis,  static  analysis,  and  symbolic  analysis  of  executables  written  in  that  instruction  set.  This 
work  advances  the  state  of  the  art  because  it  allows  multiple  analysis  components  to  be  created 
automatically  from  a  single  specification  of  the  concrete  operational  semantics  of  the  language  to 
be  analyzed.  The  system,  called  TSL  (for  “Transformer  Specification  Language”),  has  two  classes 
of  users:  (1)  instruction-set-specification  (ISS)  developers  and  (2)  analysis  developers.  The  for¬ 
mer  are  involved  in  specifying  the  semantics  of  different  instruction  sets;  the  latter  are  involved  in 
extending  the  analysis  framework.  In  designing  TSL,  we  were  guided  by  the  following  principles: 

•  There  should  be  a  formal  language  for  specifying  the  semantics  of  the  language  to  be  an¬ 
alyzed.  Moreover,  ISS  developers  should  specify  only  the  abstract  syntax  and  a  concrete 
operational  semantics  of  the  language  to  be  analyzed — each  analyzer  should  be  generated 
automatically  from  this  specification. 

•  Concrete  syntactic  issues — including  (i)  decoding  (machine  code  to  abstract  syntax),  (ii) 
encoding  (abstract  syntax  to  machine  code),  (iii)  parsing  assembly  (assembly  code  to  abstract 
syntax),  and  (iv)  assembly  pretty-printing  (abstract  syntax  to  assembly  code) — should  be 
handled  separately  from  the  abstract  syntax  and  concrete  semantics.1 

•  There  should  be  a  clean  interface  for  analysis  developers  to  specify  the  abstract  semantics 
for  each  analysis.  An  abstract  semantics  consists  of  an  interpretation :  an  abstract  domain 
and  a  set  of  abstract  operators  (i.e.,  that  performs  abstract  interpretations  of  the  operations  of 
TSL). 

•  The  abstract  semantics  for  each  analysis  should  be  separated  from  the  languages  to  be  an¬ 
alyzed  so  that  one  does  not  need  to  specify  multiple  versions  of  an  abstract  semantics  for 
multiple  languages. 

Each  of  these  objectives  has  been  achieved  in  the  TSL  system:  The  TSL  system  translates  the  TSL 
specification  of  each  instruction  set  to  a  common  intermediate  representation  (Cl R)  that  can  be 
used  to  create  multiple  analyzers.  Each  analyzer  is  specified  at  the  level  of  the  meta-language  (i.e., 

1  The  translation  of  the  concrete  syntaxes  to  and  from  abstract  syntax  is  handled  by  a  generator  tool,  called  ISAL 
for  Instruction  Set  Architecture  Language,  which  is  separate  from  TSL.  ISAL  was  developed  by  GrammaTech  [13]. 
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by  reinterpreting  the  operations  of  TSL),  which — by  extension  to  TSL  expressions  and  functions — 
provides  the  desired  reinterpretation  of  the  instructions  of  an  instruction  set. 

Client  Analyzer 
N  Analysis  Components 

interplnstr,  interplnstr2  interplnstrN 

rii  rii  rii 


TSL  System 


M  Instruction-Set  Specifications 


Figure  1.1  The  interaction  between  the  TSL  system  and  a  client  analyzer.  The  grey  boxes 
represent  TSL-generated  analysis  components. 


The  TSL  system  provides  two  dimensions  of  parameterizability:  different  instruction  sets  and 
different  analyses.  Each  ISS  developer  specifies  an  instruction-set  semantics,  and  each  analysis 
developer  defines  an  abstract  domain  for  a  desired  analysis  by  giving  an  interpretation  (i.e.,  the 
implementations  of  TSL  basetypes,  basetype-operators,  and  map-access/update  functions).  Given 
the  inputs  from  these  two  classes  of  users,  the  TSL  system  automatically  generates  an  analysis 
component.  Thus,  to  create  MxiV  analysis  components,  the  TSL  system  only  requires  M  speci¬ 
fications  of  the  concrete  semantics  of  instruction  sets,  and  N  analysis  implementations  (Fig.  1.1), 
i.e.,  M  +  N  inputs  are  used  to  obtain  M  x  N  analysis-component  implementations. 


Many  for  the  price  of  one!  In  Fig.  1.1,  once  one  has  the  N  analysis  implementations  that  are  the 
core  of  some  client  analyzer  A,  one  obtains  a  generator  that  can  create  different  versions  A/ Mi, 
A I  Mi, ...  at  the  cost  of  writing  specifications  of  the  concrete  semantics  of  instruction  sets  M1;  M2, 
etc.  Thus,  each  client  analyzer  A  created  using  analysis  components  generated  via  TSL  acts  as  a 
“Y ACC-like”  tool  for  generating  different  versions  of  A  automatically. 
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Si:  x  =  x  ©  y\ 
s2-.y  =  x®y1 
s3:  x  =  x  ©  y; 

Figure  1.2  Code  fragment  that  swaps  two  ints; 

1.4.1  Semantic  Reinterpretation 

The  TSL  system  is  based  on  factoring  the  concrete  semantics  of  a  language  into  two  parts:  (i) 
a  client  specification,  and  (ii)  a  semantic  core.  The  interface  to  the  core  consists  of  certain  base 
types,  function  types,  and  operators  (sometimes  called  a  semantic  algebra  [166]),  and  the  client 
is  expressed  in  terms  of  this  interface.  This  organization  permits  the  core  to  be  reinterpreted  to 
produce  an  alternative  semantics  for  the  subject  language.2 

Semantic  Reinterpretation  for  Abstract  Interpretation.  The  idea  of  exploiting  such  a  factor¬ 
ing  comes  from  the  field  of  abstract  interpretation  [73],  where  factoring-plus-reinterpretation  has 
been  proposed  as  a  convenient  tool  for  formulating  abstract  interpretations  and  proving  them  to  be 
sound  [134,  144,  148].  In  particular,  soundness  of  the  entire  abstract  semantics  can  be  established 
via  purely  local  soundness  arguments  for  each  of  the  reinterpreted  operators. 

The  following  example  shows  the  basic  principles  of  semantic  reinterpretation  in  the  context  of 
abstract  interpretation.  We  use  a  simple  language  of  assignments,  and  define  the  concrete  semantics 
and  an  abstract  sign-analysis  semantics  via  semantic  reinterpretation. 

Example  1.1  [Adapted  from  [134].]  Consider  the  following  fragment  of  a  denotational  semantics, 

which  defines  the  meaning  of  assignment  statements  over  variables  that  hold  signed  32-bit  int 

2Semantic  reinterpretation  is  a  program-generation  technique,  and  thus  we  follow  the  terminology  of  the  partial- 
evaluation  literature  [108],  where  the  program  on  which  the  partial  evaluator  operates  is  called  the  subject  program. 

In  logic  and  linguistics,  the  programming  language  would  be  called  the  “object  language”.  In  the  compiler 
literature,  an  object  program  is  a  machine-code  program  produced  by  a  compiler,  and  so  we  avoid  using  the  term 
“object  programs”  for  the  programs  that  TSL  operates  on. 
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values  (where  ©  denotes  exclusive-or): 

I  G  Id  Eg  Expr  :  :=  /  |  Ei®  E2  |  ... 

S  G  Stmt I  =  E;  a  G  Stole  =  Id  — >  Int32 

£  :  £xpr  — >  Stole  — >  Z«/S2 
S[/]a  =  a/ 

S[JB1©^2]a  =  S[E1]a®S[JE;2]a 

J  :  Stmt  — >  Stole  — >  Stole 
J[/  =  E;]o-  =  <r[J  ^  £[£]cr] 

By  4  ‘cr[/  i — >  n],”  we  mean  the  function  that  acts  like  a  except  that  argument  /  is  mapped  to  v. 
The  specification  given  above  can  be  factored  into  client  and  core  specifications  by  introducing  a 
domain  Val,  as  well  as  operators  xor,  lookup ,  and  store.  The  client  specification  is  defined  by 


xor  :  Val  — >  Val  — >  VhZ 
lookup  :  Stole  — >  Id  — >  Val 
store  :  Stole  — ►  /J  — >  Val  — >  State 

£  :  Expr  — >  Stole  — »  VhZ 

£[/]<t  =  lookup  a  I 

£\Ei  ©  E2Ja  =  £\Ex\a  xor  £\E2\a 

1  :  Stmt  — >  Stole  — >  Stole 
J[/  =  £Z;]cr  =  store  o  I  S[f?]<r 

For  the  concrete  (or  “standard”)  semantics,  the  semantic  core  is  defined  by 


v  G  Valstd  =  lnl32 
Statesld  —  Id  — >  Val 


lookup std  =  Xa.XLal 
store std  =  Xcr.XLXv.a[I  >  t>] 
xorstd  =  Avi.Av2.Vi  ©  u2 


Different  abstract  interpretations  can  be  defined  by  using  the  same  client  semantics,  but  giving 
different  interpretations  to  the  base  types,  function  types,  and  operators  of  the  core.  For  example, 
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a0  :=  {x  i->  neg,  y  i— >  pos} 

<Ti  :=  I{si  :  x  =  x  ®  y,jo0  =  storeabs  o0  x  ( neg  xorabs  pos )  =  (z  i->  neg,  y  i->  pos} 

cr2  :=  X[s2  :  y  =  x  ©  p;]ai  =  store abs  ay  y  ( neg  xorabs  pos )  =  {x  i->  neg,  y  i->  neg} 

03  :=  J[s3  :  x  =  x  ©  p;]cr2  =  store abs  o2  x  ( neg  xorabs  neg )  =  {x  T,  y  i->  neg}. 

Figure  1.3  Application  of  the  abstract  transformers  created  by  the  sign-analysis  reinterpretation 
to  the  initial  abstract  state  =  {x  1— >•  neg,  y  1— »•  7905}. 

for  sign  analysis,  assuming  that  7nt32  values  are  represented  in  two’s -complement  notation,  the 
semantic  core  is  reinterpreted  as  follows:3 

v  G  Valabs  =  {neg,  zero,  pos}T 
State abs  =  Id  — >  Valabs 
lookup  abs  =  Xo.XI.oI 
storeabs  =  Act.  A/.  An.  cr  [/  1— ►  n] 


wr*  =  At’i.An2. 


For  the  code  fragment  shown  in  Fig.  1.2,  which  swaps  two  ints,  sign-analysis  reinterpretation 
creates  abstract  transformers  that,  given  the  initial  abstract  state  o0  =  {x  1— >  neg,  y  1— >•  pos} , 
produce  the  abstract  states  shown  in  Fig.  1.3.  □ 

Semantic  Reinterpretation  in  TSL.  The  mapping  of  a  client  specification  to  the  operations 
of  the  semantic  core  that  one  defines  in  a  semantic  reinterpretation  resembles  a  translation  to  a 

3For  the  two’s-complement  representation,  pos  xorabs  neg  =  neg  xorabs  pos  =  neg  because,  for  all  combinations 
of  values  represented  by  pos  and  neg ,  the  high-order  bit  of  the  result  is  set,  which  means  that  the  result  is  always 
negative.  However,  pos  xorabs  pos  =  neg  xorabs  neg  =  T  because  the  concrete  result  could  be  either  0  or  positive,  and 
zero  U  pos  =  T. 
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common  intermediate  representation  (CIR)  data  structure.  Thus,  another  approach  to  obtaining 
“systematic”  reinterpretations  that  are  similar  to  semantic  reinterpretations — in  that  they  apply  to 
multiple  subject  languages — is  to  translate  subject-language  programs  to  a  CIR,  and  then  create 
various  interpreters  that  implement  different  abstract  interpretations  of  the  node  types  of  the  CIR 
data  structure.  Each  interpreter  can  be  applied  to  (the  translation  of)  programs  in  any  subject  lan¬ 
guage  L  for  which  one  has  defined  an  L-to-CIR  translator.  Compared  with  interpreting  objects  of 
a  CIR  data  type,  the  advantages  of  semantic  reinterpretation  (i.e.,  reinterpreting  the  constructs  of 
the  meta-language )  are 

1.  The  presentation  of  our  ideas  is  simpler  because  one  does  not  have  to  introduce  an  additional 
language  of  trees  for  representing  CIR  objects. 

2.  With  semantic  reinterpretation,  there  is  no  explicit  CIR  data  structure  to  be  interpreted.  In 
essence,  semantic  reinterpretation  removes  a  level  of  interpretation,  and  hence  generated 
analyzers  should  run  faster. 

1.4.2  Technical  Contributions  Incorporated  in  the  Tsl  Compilation  Process 

The  specific  technical  contributions  incorporated  in  the  part  of  the  TSL  compiler  that  generates 
the  CIR  can  be  summarized  as  follows: 

•  Two-Level  Semantics:  The  notion  of  a  two-level  intermediate  language  [149]  has  been 
used  to  generate  the  CIR  in  a  way  that  reduces  the  loss  of  precision  that  could  otherwise 
come  about  with  certain  reinterpretation.  To  address  this  issue,  the  TSL  compiler  performs 
binding-time  analysis  [108]  on  the  TSL  specification  to  identify  which  values  can  always 
be  treated  as  concrete  values,  and  which  operations  should  therefore  be  performed  in  the 
concrete  domain  (i.e.,  should  not  be  reinterpreted).  §3.2.1  discusses  more  details  of  the  two- 
level  intermediate  language  along  with  binding-time  analysis. 

•  Abstract  Interpretation:  From  a  specification,  the  TSL  compiler  generates  a  CIR  that  has  the 
ability  (i)  to  execute  over  abstract  states,  (ii)  possibly  propagate  abstract  states  to  more  than 
one  successor  in  a  conditional  expression,  (iii)  compare  abstract  states  and  terminate  abstract 
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execution  when  a  fixed  point  is  reached,  and  (iv)  apply  widening  operators,  if  necessary,  to 
ensure  termination.  §3.2.2  contains  a  detailed  discussion  of  these  issues. 

•  Paired  Semantics:  The  TSL  system  allows  easy  instantiations  of  reduced  products  by  means 
of  paired  semantics.  The  Cl  R  can  be  instantiated  with  a  paired  semantic  domain  that  couples 
two  interpretations.  Communication  between  the  values  carried  by  the  two  interpretations 
may  take  place  in  the  TSL  base-type  operators.  §3.2.3  discusses  more  details  of  paired 
semantics. 

1.5  Overview  of  Applications  of  the  T SL  System 

The  capabilities  of  the  TSL  system  have  been  demonstrated  by  writing  specifications  for 
both  the  IA32  and  PowerPC  instruction  sets,  and  then  automatically  creating  a  variety  of  analy¬ 
sis  components  from  each  of  the  specifications — including  dynamic-analysis  components,  static- 
analysis  components  and  symbolic-analysis  components  from  each  of  the  specifications.  The 
TSL-generated  static-analysis  components  have  been  used  to  develop  a  parameterized  version  of 
CodeSurfer/x86.  That  is,  using  TSL,  one  can  create  CodeSurfer/M  by  writing  a  specification  of 
the  concrete  semantics  of  instruction  set  M  (§1.5.1).  The  dynamic-analysis  and  symbolic-analysis 
components  generated  using  TSL  have  been  used  to  develop  the  semantic  primitives  (§1.5.2)  used 
in  (parameterized  versions  of)  a  model-checking  tool  for  machine  code  (§1.5.3)  and  a  concolic- 
execution-based  tool  for  analyzing  bot  executables  (§1.5.4). 

1.5.1  Static-Analysis  Components 

The  TSL  system  has  been  applied  to  creating  the  analysis  components  employed  by 
CodeSurfer/x86  [39],  which  is  a  static-analysis  framework  for  analyzing  stripped  x86  executables. 
The  TSL-generated  analysis  components  include  value-set  analysis  [38,  41],  affine-relation  analy¬ 
sis  [38],  def-use  analysis  (for  memory,  registers,  and  flags),  and  aggregate  structure  identification 
[42], 

•  Value-Set  Analysis  (VSA).  VSA  is  a  combined  numeric-analysis  and  pointer- analysis  algo¬ 
rithm  that  determines  a  safe  approximation  of  the  set  of  numeric  values  and  addresses  that 
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each  register  and  memory  location  holds  at  each  program  point  [41].  A  memory  region  is  an 
abstract  quantity  that  represents  all  runtime  activation  records  of  a  procedure.  To  represent  a 
set  of  numeric  values  and  addresses,  VSA  uses  value-sets,  where  a  value-set  associates  each 
memory-region  with  a  map  from  abstract  locations  to  strided  intervals.  A  strided  interval 
represents  a  set  of  numbers  with  a  lower  bound,  an  upper  bound,  and  a  stride  [160]. 

•  Affine-Relation  Analysis  (ARA).  An  affine  relation  is  a  linear-equality  constraint  between 
integer-valued  variables.  ARA  finds  all  affine  relationships  that  hold  in  the  program,  for  a 
given  set  of  variables.  This  analysis  is  used  to  find  induction-variable  relationships  between 
registers  and  memory  locations;  these  help  in  increasing  the  precision  of  VSA  when  inter¬ 
preting  conditional  branches  [38]. 

•  Aggregate-Structure  Identification  (ASI).  ASI  is  a  unification-based,  flow-insensitive  al¬ 
gorithm  to  identify  the  structure  of  aggregates  in  a  program  [42].  For  each  instruction,  the 
TSL-generated  analysis  component  generates  a  set  of  ASI  commands,  each  of  which  is  ei¬ 
ther  a  command  to  split  a  memory  region  or  a  command  to  unify  some  portions  of  memory 
(and/or  some  registers).  At  analysis  time,  a  client  analyzer  typically  applies  the  generated 
ASI-command  generator  to  each  of  the  instructions  in  the  program,  and  then  feeds  the  result¬ 
ing  set  of  ASI  commands  to  an  ASI  solver  to  refine  the  memory  regions. 

•  Quantifier-Free  Bit-Vector  (QFBV)  semantics.  QFBV  semantics  provides  a  way  to  obtain 
a  symbolic  representation — as  a  formula  in  first-order  quantifier-free  bit-vector  logic — of  an 
instruction’s  semantics. 

•  Def-Use  Analysis  (DUA).  Def-Use  analysis  collects  all  the  definitions  and  uses  of  state 
components  (memory-locations,  registers,  and  flags)  for  each  instruction. 

These  analysis  components  have  been  put  together  to  create  a  system  that  essentially  duplicates 

CodeSurfer/x86. 

1.5.2  Symbolic-Analysis  Components 


Symbolic  analysis  has  been  an  effective  technique  for  testing  and  verifying  programs  because 
of  the  power  that  they  provide  in  exploring  a  program’s  state  space.  The  TSL  system  has  been 
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applied  to  creating  implementations  of  the  basic  primitives  used  in  certain  kinds  of  verification 
and  testing  tools  that  are  based  on  symbolic  program  analysis.  By  “symbolic  program  analysis”, 
we  mean  logic-based  techniques  to  analyze  state  changes  along  individual  program  paths.  This  is 
in  contrast  to  the  situation  addressed  by  many  abstract-interpretation/dataflow-analysis  techniques, 
which  usually  consider  the  problem  of  analyzing  the  effects  of  a  collection  of  program  paths — 
e.g.,  to  identify  program  invariants.  The  basic  primitives  used  in  symbolic  analysis  are  functions 
that  perform  forward  symbolic  evaluation ,  weakest  precondition ,  and  symbolic  composition  by 
manipulating  formulas. 

The  conventional  approach  to  implementing  systems  that  use  symbolic  analysis  is  to  write 
each  of  the  three  symbolic-analysis  functions  by  hand  for  the  programming  language  of  interest. 
Our  goal  was  to  develop  a  method  to  create  implementations  of  symbolic-analysis  primitives  eas¬ 
ily,  so  that  they  can  be  made  available  for  different  subject  languages — particularly  for  different 
machine-code  instruction  sets.  Such  instruction  sets  typically  have  (i)  several  hundred  instruc¬ 
tions,  (ii)  a  variety  of  architecture- specific  features  that  are  incompatible  with  other  architectures, 
and  (iii)  the  ability  to  perform  address  arithmetic  and  dereferencing  of  addresses,  which  means  that 
memory  states  can  have  complicated  aliasing  patterns.  Consequently,  our  goal  was  to  generate 
implementations  of  such  primitives  automatically  from  a  specification  of  the  subject  language’s 
concrete  semantics. 

Semantic  reinterpretation  for  symbolic  analysis.  As  a  new  application  for  semantic  reinterpre¬ 
tation,  we  created  implementations  of  the  basic  primitives  used  in  symbolic  program  analysis.  The 
aforementioned  techniques  and  tools  in  the  literature  apply  symbolic  analysis  to  programs  writ¬ 
ten  in  languages  with  pointers,  aliasing,  dereferencing,  and  address  arithmetic.  We  demonstrate 
that  the  reinterpretation  technique  provides  a  way  to  create  symbolic-analysis  primitives  for  such 
languages. 

With  TSL  each  reinterpretation  is  defined  at  the  meta-level ,  by  reinterpreting  the  collection  of 
TSL  base  types,  function  types,  and  operators.  When  a  reinterpretation  is  performed  in  this  way,  it 
is  independent  of  any  given  subject  language.  Consequently,  with  our  implementation,  all  three  of 
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the  symbolic-analysis  primitives  can  be  generated  automatically  for  every  instruction  set  for  which 
one  has  a  TSL  specification. 

1.5.3  Me  VETO:  A  Refinement-Based  Model  Checker  for  Machine  Code 

We  used  TSL  to  develop  a  model  checker  for  machine  code,  called  MCVETO  (Machine-Code 
VErification  TOol).  MCVETO  uses  directed  proof  generation  [98]  to  find  either  an  input  that 
causes  a  (bad)  target  state  to  be  reached,  or  a  proof  that  the  bad  state  cannot  be  reached.  (The  third 
possibility  is  that  MCVETO  fails  to  terminate.)  What  distinguishes  the  work  on  MCVETO  is  that 
it  addresses  a  large  number  of  issues  that  have  been  ignored  in  previous  work  on  software  model 
checking,  and  would  cause  previous  techniques  to  be  unsound  if  applied  to  machine  code. 

In  our  implementation,  we  restricted  ourselves  to  use  only  language-independent  techniques. 
In  particular,  we  used  a  technique  for  generating  automatically  some  of  the  key  primitives  of 
MCVETO’s  analysis  components  from  descriptions  of  an  instruction  set’s  syntax  and  semantics 
[125,  126] — i.e.,  (a)  an  emulator  for  running  tests,  (b)  a  primitive  for  performing  symbolic  ex¬ 
ecution,  and  (c)  a  primitive  for  the  pre-image  operator.  In  addition,  we  developed  language- 
independent  approaches  to  the  issues  discussed  above.  Consequently,  our  system  acts  as  a  YACC- 
like  tool  for  creating  versions  of  MCVETO  for  different  instruction  sets:  given  an  instruction-set 
description,  a  version  of  MCVETO  is  generated  automatically.  We  created  two  such  instantiations 
of  MCVETO  from  descriptions  of  the  Intel  x86  and  PowerPC  instruction  sets. 

MCVETO  is  described  in  full  detail  in  [174, 175].  §5.1  describes  my  contributions  to  MCVETO. 

1.5.4  BCE:  Analyzing  Bot  Executables 

An  increasing  number  of  computers  have  been  compromised  by  attacks  from  across  the  world 
to  become  part  of  malicious  botnets  [25].  Botnets  seriously  undermine  computer  security  and 
reliability  by  conducting  illegitimate  activities,  such  as  performing  large-scale  distributed  denial- 
of-service  attacks;  identity  theft;  sending  spam,  trojans,  and  phishing  emails;  distributing  pirated 
media;  and  performing  click  fraud.  Moreover,  botnets  can  quickly  grow  by  using  worms  to  attack 
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vulnerable  systems.  During  the  time  between  an  announcement  of  a  vulnerability  and  a  patch  for 
the  vulnerability,  the  potential  for  bot  infiltration  is  particularly  high. 

The  Internet  security  research  community  has  made  significant  efforts  to  identify  botnets,  to 
collect  data  on  their  activities,  and  to  develop  techniques  for  detection,  mitigation,  and  disruption. 
Some  bots  try  to  avoid  detection  by  using  slow-spreading  infection  techniques.  Some  use  multi¬ 
ple  levels  of  indirection  to  make  it  harder  to  understand  the  botnet’s  structure.  There  have  been 
several  techniques  to  detect  bots  by  monitoring  network  traffic  to  obtain  temporal/spatial  behavior 
statistics.  Network-based  and  behavior-based  approaches  have  several  drawbacks:  the  approaches 
are  (i)  costly  (runtime  overhead  to  monitor  network  traffic,  space  overhead  for  storing  packet  logs, 
etc.),  (ii)  easily  evaded,  and  (iii)  not  able  to  recover  the  structure  of  a  botnet.  Some  detection 
techniques  rely  on  well-known  bot  communication  signatures:  a  lot  of  bot  code  is  reused,  and  thus 
the  commands  and  authentication  mechanisms  are  widely  known.  However,  attackers  can  easily 
modify  the  command- and-control  language  used  by  their  bots  to  raise  the  bar  for  detection  and 
control. 

Using  the  TSL  system,  we  have  developed  a  tool  called  BCE  for  extracting  botnet-command 
information  from  bot  executables.  BCE  aims  to  provide  useful  information  from  analysis  of  bot 
executables  by  automatically  extracting  proper  inputs  that  trigger  malicious  behavior.  Applications 
of  the  information  recovered  include  observing  and  analyzing  malicious  behaviors,  as  well  as 
identifying  command  sequences  that  can  be  used  at  either  the  network  or  host  level  to  mitigate 
botnets. 

A  typical  way  to  analyze  the  behavior  of  a  bot  is  to  run  the  executable  and  observe  its  actions. 
To  carry  this  out,  however,  one  needs  proper  inputs  that  trigger  malicious  behaviors.  Some  widely 
known  commands  are  often  used  for  this  purpose.  However,  attackers  can  easily  change  their  com¬ 
mands  to  evade  such  an  approach.  It  is  a  hard  problem  to  obtain  such  inputs  by  manually  stepping 
through  the  executable.  BCE  automates  the  extraction  of  information  about  botnet  commands,  and 
the  arguments  to  commands,  by  driving  the  bot  executable  toward  places  where  system  calls  are 
invoked. 


In  §5.2,  we  present  BCE  in  detail. 
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1.6  Contributions  and  Organization  of  the  Dissertation 

The  specific  technical  contributions  of  our  work,  along  with  the  organization  of  the  dissertation, 
are  summarized  as  follows: 

In  Chapter  2,  starting  off  the  discussion  on  the  advantages  of  machine-code  analysis  over 
source-code  analysis  and  the  challenges  of  machine-code  analysis,  we  introduce  CodeSurfer/x86 
followed  by  an  overview  of  two  applications  to  which  I  applied  CodeSurfer/x86 — FFE/x86  and 
ConSeq.  Lastly,  we  discuss  the  motivation  for  the  main  contribution  of  the  dissertation,  namely 
the  TSL  system. 

In  Chapter  3,  we  present  the  TSL  system  in  detail.  TSL  will  be  presented  from  two  perspec¬ 
tives:  (i)  how  to  write  a  TSL  specification  (from  the  point  of  view  of  instruction-set-specification 
developers),  and  (ii)  how  to  write  domains  for  (re)interpreting  the  TSL  base-types  (from  the  point 
of  view  of  analysis  developers).  We  also  summarize  the  applications  to  which  TSL  has  been 
used,  including  various  static-analysis  components  that  duplicate  the  hand-written  ones  used  in 
CodeSurfer/x86,  and  discuss  the  leverage  that  we  obtained  through  TSL. 

In  Chapter  4,  we  discuss  the  techniques  that  we  developed  to  automatically  create  three 
symbolic-analysis  primitives,  and  describe  how  the  TSL  system  was  used  for  that  purpose.  In 
particular,  we  show  how  semantic  reinterpretation  can  be  applied  to  create  analysis  functions  that 
compute  formulas  for  forward  symbolic  evaluation,  weakest  precondition,  and  symbolic  composi¬ 
tion. 

In  Chapter  5,  we  present  case  studies,  including  MCVETO,  a  model-checking  tool  for  machine 
code,  which  uses  the  symbolic-analysis  primitives  generated  from  the  TSL  system,  and  BCE,  a 
concolic-execution-based  application,  which  extracts  information  about  botnet  commands  from 
bot  executables. 

We  present  our  conclusions  in  Chapter  6. 
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Chapter  2 

Machine- Code  Analysis 

Computers  do  not  execute  source  code;  they  execute  machine  code  generated  from  source 
code  by  the  combined  efforts  of  the  compiler,  the  optimizer,  and  the  linker.  The  compiler  and  the 
optimizer  make  certain  choices  when  generating  machine  code,  depending  on  the  target  platform; 
therefore,  there  can  be  mismatches  in  various  ways  between  what  is  actually  executed  on  the 
processor  and  what  a  programmer  really  intends  in  her  source  code.  Balakrishnan  et  al.  refer 
to  such  a  phenomenon  as  WYSINWYX  (“What  You  See  Is  Not  What  You  eXecute”)  ([41],  [45]  and 
[40,  §1]). 

The  following  example  (obtained  from  [38])  shows  a  security  vulnerability  introduced  due  to 
the  WYSINWYX  phenomenon: 

memset (password,  0,  len) ; 
free (password) ; 

The  password  in  clear  text  is  stored  in  a  dynamically-allocated  buffer.  Because  the  password  is 
sensitive  information,  to  minimize  the  lifetime  of  the  password,  the  programmer  tries  to  zero-out 
the  buffer  by  calling  memset  before  returning  it  to  the  heap  by  calling  free.  However,  the  memset 
call  might  be  eliminated  by  a  compiler  that  performs  useless -code  elimination,  based  on  the  rea¬ 
soning  that  the  program  never  uses  the  value  written  by  the  call  on  that  function.  Unfortunately,  if 
this  happens,  sensitive  information  would  be  exposed  in  the  heap. 

As  the  above  example  illustrates,  various  vulnerabilities  can  be  introduced  by  the  compiler 
and  the  optimizer  due  to  the  idiosyncrasies  inherited  from  a  myriad  of  platform-specific  features 
and  various  artifacts  of  the  compiler  and  optimizer.  These  include  (i)  memory-layout  details  (i.e., 
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offsets  of  variables  in  the  run-time  stack  activation  records  and  padding  between  fields  of  a  struct), 
(ii)  register  usage,  (iii)  execution  order,  (iv)  optimizations,  and  (v)  artifacts  of  compiler  bugs. 
Many  security  exploits  make  use  of  such  artifacts  [105,  135],  and  thus  the  target  program  can  be 
executed  by  an  attacker  so  that  it  operates  differently  (maliciously)  from  what  is  really  intended  by 
the  programmer. 

Such  security  vulnerabilities  can  escape  the  notice  of  tools  that  work  on  intermediate  repre¬ 
sentations  (IRs)  that  are  built  directly  from  the  source  code,  whereas  they  are  visible  to  analysis 
tools  that  work  on  machine  code.  In  addition,  there  are  a  number  of  reasons  why  analyses  based 
on  source  code  do  not  provide  the  right  level  of  detail  for  checking  certain  kinds  of  properties,  and 
machine-code  analyses  do.  Moreover,  many  issues  arise  when  analyzing  source  code  disappear 
when  analyzing  machine  code:  although  Balakrishnan  et  al.  have  argued  at  length  with  examples 
the  benefits  of  analyzing  machine  code  rather  than  source  code  in  [41],  [45]  and  [40,  §1],  we 
summarize  them  in  the  following  list: 

•  Source-level  tools  are  only  applicable  when  source  code  is  available,  which  limits  their 
usefulness  in  security  applications  (e.g.,  to  analyzing  code  downloaded  from  the  web  or 
commercial  off-the-shelf  (COTS)  applications,  whose  source  code  is  usually  unavailable). 
In  particular,  source-level  tools  cannot  be  applied  to  analyzing  vimses  and  worms.  Most 
applications  are  distributed  as  executables  that  have  no  symbol-table/debugging  informa¬ 
tion  (“stripped  executables”).  Although  symbol-table/debugging  information  can  be  used  to 
adapt  source-level  analysis  techniques  to  work  on  machine  code  (when  source  code  is  un¬ 
available),  most  analysis  techniques  are  severely  hampered  when  symbol-table/debugging 
information  is  absent. 

•  Even  if  source  code  is  available,  as  discussed  earlier,  a  substantial  amount  of  information  is 
hidden  from  analyses  that  start  from  source  code,  which  can  cause  bugs,  security  vulnerabil¬ 
ities,  and  malicious  behavior  to  be  invisible  to  such  tools.  Moreover,  a  source-code  tool  that 
strives  to  have  greater  fidelity  to  the  program  that  is  actually  executed  would  have  to  duplicate 
all  of  the  choices  made  by  the  compiler  and  optimizer;  such  an  approach  would  be  extremely 
complicated  to  carry  out.  As  alternative  approach  would  be  to  use  a  compiler  infrastructure, 
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such  as  LLVM  [26]  or  GCC  [12],  that  supports  multiple  compilers/optimizers.  Such  an  ap¬ 
proach  would  allow  a  source-code  analysis  tool  to  analyze  the  effects  caused  by  compiler 
artifacts,  but  only  for  code  created  via  the  compiler  infrastructure  on  which  the  analyzer  is 
based.  To  make  an  analyzer  comprehensive  by  mimicking  multiple  compilers/optimizers 
would  require  following  such  an  approach  for  each  possible  compiler  infrastructure — some 
of  which  are  proprietary  (e.g.,  the  Microsoft  Visual  Studio  compiler  cl).  In  contrast,  ana¬ 
lyzing  machine  code  directly  provides  a  comprehensive  solution:  each  run  of  the  analyzer 
would  give  an  answer  for  the  machine-code  program  to  which  it  is  applied,  but  such  an  an¬ 
alyzer  can  be  applied  to  machine-code  programs  produced  by  any  compiler  infrastructure, 
not  just  a  particular  one. 

•  Programs  are  sometimes  modified  subsequent  to  compilation,  e.g.,  to  perform  optimizations 
or  insert  instrumentation  code  [182]  or  [112,  176].  Such  modifications  are  not  visible  to 
tools  that  analyze  source  code. 

•  Machine-code  analysis  has  an  advantage  that  behavioral  models  derived  from  machine  code 
can  be  more  accurate  than  models  derived  from  source  code  (particularly  because  com¬ 
pilation,  optimization,  and  link- time  transformation  can  change  how  the  code  behaves). 
Also,  certain  choices  that  the  compiler  and  optimizer  make  can  eliminate  some  possible 
behaviors — hence  there  is  sometimes  the  opportunity  to  obtain  more  precise  answers  from 
machine-code  analysis  than  from  source-code  analysis. 

•  Analyses  based  on  source  code  typically  make  (unchecked)  assumptions,  e.g.,  that  the  pro¬ 
gram  is  ANSI-C  compliant.  This  often  means  that  an  analysis  does  not  account  for  be¬ 
haviors  that  are  allowed  by  the  compiler  (e.g.,  arithmetic  is  performed  on  pointers  that  are 
subsequently  used  for  indirect  function  calls;  pointers  move  off  the  ends  of  arrays  and  are 
subsequently  dereferenced;  etc.). 

•  Programs  typically  make  extensive  use  of  libraries,  including  dynamically-linked  libraries 
(DLLs),  which  may  not  be  available  in  source-code  form.  Typically,  analyses  are  performed 
using  code  stubs  that  model  the  effects  of  library  calls.  Because  these  are  created  by  hand, 
they  are  error-prone,  and  thus  the  analysis  can  return  incorrect  results.  Because  library  code 
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can  be  analyzed  directly  in  machine-code  analysis,  it  is  not  necessary  to  rely  on  potentially- 
unsound  models  of  library  functions. 1 

•  The  source  code  may  have  been  written  in  more  than  one  language.  This  complicates  the  life 
of  designers  of  tools  that  analyze  source  code  because  multiple  languages  must  be  supported, 
each  with  its  own  quirks. 

•  Even  if  the  source  code  is  primarily  written  in  one  high-level  language,  it  may  contain  in- 
lined  assembly  code  in  selected  places.  Source-level  tools  typically  either  skip  over  Mined 
assembly  code  [36]  or  do  not  push  the  analysis  beyond  sites  of  inlined  assembly  code  [4]. 
Even  if  the  source  code  was  written  in  more  than  one  language,  a  tool  that  analyzes  executa¬ 
bles  only  needs  to  support  one  language.  Instructions  inserted  because  of  inlined  assembly 
directives  in  the  source  code  are  visible,  and  do  not  need  to  be  treated  any  differently  than 
other  instructions. 

•  An  additional  class  of  examples  for  which  analysis  of  an  executable  can  provide  more  ac¬ 
curate  information  than  a  source-level  analysis  arises  because,  for  many  programming  lan¬ 
guages,  certain  behaviors  are  left  unspecified  by  the  semantics.  In  such  cases,  a  source-level 
analysis  must  account  for  all  possible  behaviors,  whereas  an  analysis  of  an  executable  gen¬ 
erally  only  has  to  deal  with  one  possible  behavior — namely,  the  one  for  the  code  sequence 
chosen  by  the  compiler.  For  instance,  in  C  and  C++  the  order  in  which  actual  parameters 
are  evaluated  is  not  specified:  actuals  may  be  evaluated  left-to-right,  right-to-left,  or  in  some 
other  order;  a  compiler  could  even  use  different  evaluation  orders  for  different  functions. 
Different  evaluation  orders  can  give  rise  to  different  behaviors  when  actual  parameters  are 
expressions  that  contain  side  effects.  For  a  source-level  analysis  to  be  sound,  at  each  call 
site  it  must  take  the  union  of  the  descriptors  that  result  from  analyzing  each  permutation  of 
the  actuals.  In  contrast,  an  analysis  of  an  executable  only  needs  to  analyze  the  particular 
sequence  of  instructions  that  lead  up  to  the  call. 

Machine-code  analysis  gives  platform-specific  answers.  Models  can  be  beneficial  in  obtaining  answers  that  apply 
to  multiple  platforms  by  providing  an  answer  relevant  to  all  library  versions  that  conform  to  the  model. 
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2.1  Challenges  in  Machine-Code  Analysis 

Even  though  the  advantages  of  analyzing  executables  are  appreciated  and  well-understood, 
because  of  the  obstacles  standing  in  the  way  of  doing  a  good  job  of  machine-code  analysis,  there 
are  a  dearth  of  tools  that  work  on  executables  directly.  Compared  with  source-code  analysis, 
analysis  of  stripped  executables  presents  many  challenges  and  difficulties,  including 

•  absence  of  information  about  variables:  In  stripped  executables,  no  information  is  provided 
about  the  program’s  global  and  local  variables. 

•  a  semantics  based  on  aflat  memory  model:  With  machine  code,  there  is  no  notion  of  separate 
“protected”  storage  areas  for  the  local  variables  of  different  procedure  invocations,  nor  any 
notion  of  protected  fields  of  an  activation  record.  For  instance,  a  procedure’s  return  address 
is  stored  on  the  stack;  an  analyzer  must  prove  that  it  is  not  corrupted,  or  discover  what  new 
values  it  could  have. 

•  absence  of  type  information:  In  particular,  int-valued  and  address-valued  quantities  are 
indistinguishable  at  runtime. 

•  arithmetic  on  addresses  is  used  extensively:  Moreover,  numeric  and  address-dereference  op¬ 
erations  are  inextricably  intertwined,  even  during  simple  operations.  For  instance,  consider 
the  load  of  a  local  variable  v,  located  at  offset  -12  in  the  current  activation  record,  into  reg¬ 
ister  eax:  raov  eax,  [ebp-12]  .2  This  instmction  involves  a  numeric  operation  (ebp-12)  to 
calculate  an  address  whose  value  is  then  dereferenced  ([ebp-12])  to  fetch  the  value  of  v, 
after  which  the  value  is  placed  in  eax. 

2For  readers  who  need  a  brief  introduction  to  the  32-bit  Intel  x86  instmction  set  (also  called  IA32),  it  has  six  32-bit 
general-purpose  registers  (eax,  ebx,  ecx,  edx,  esi,  and  edi),  plus  two  additional  registers:  ebp,  the  frame  pointer, 
and  esp,  the  stack  pointer.  By  convention,  register  eax  is  used  to  pass  back  the  return  value  from  a  function  call. 
In  Intel  assembly  syntax,  the  movement  of  data  is  from  right  to  left  (e.g.,  mov  eax,  ecx  sets  the  value  of  eax  to  the 
value  of  ecx).  Arithmetic  and  logical  instructions  are  primarily  two-address  instructions  (e.g.,  add  eax ,  ecx  performs 
eax  :  =  eax  +  ecx).  An  operand  in  square  brackets  denotes  a  dereference  (e.g. ,  if  v  is  a  local  variable  stored  at  offset 
-12  off  the  frame  pointer,  mov  [ebp-12]  ,  ecx  performs  v  :  =  ecx).  Branching  is  carried  out  according  to  the  values 
of  condition  codes  (“flags”)  set  by  an  earlier  instmction.  For  instance,  to  branch  to  LI  when  eax  and  ebx  are  equal, 
one  performs  cmp  eax,  ebx,  which  sets  ZF  (the  zero  flag)  to  1  iff  eax  —  ebx  =  0.  At  a  subsequent  jump  instruction 
jz  LI,  control  is  transferred  to  LI  if  ZF  =  1;  otherwise,  control  falls  through. 
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•  instruction  aliasing:  Programs  written  in  instruction  sets  with  varying-length  instructions, 
such  as  x86,  can  have  “hidden”  instructions  starting  at  positions  that  are  out-of-registration 
with  the  instruction  boundaries  of  a  given  reading  of  an  instruction  stream  [128]. 

•  self-modifying  code:  With  self-modifying  code  there  is  no  fixed  association  between  an 
address  and  the  instruction  at  that  address. 

Standard  approaches  to  source-code  analysis  assume  that  certain  information  is  available — or  at 
least  obtainable  by  separate  analysis  phases  with  limited  interactions  between  phases,  e.g., 

•  a  control-flow  graph  (CFG),  or  interprocedural  CFG  (ICFG) 

•  a  call  graph 

•  a  set  of  variables,  split  into  disjoint  sets  of  local  and  global  variables 

•  a  set  of  non-overlapping  procedures 

•  type  information 

•  points-to  information  or  alias  information 

The  availability  of  such  information  permits  the  use  of  techniques  that  can  greatly  aid  the  analysis 
task.  For  instance,  when  one  can  assume  that  (i)  the  program’s  variables  can  be  split  into  (a)  global 
variables  and  (b)  local  variables  that  are  encapsulated  in  a  conceptually  protected  environment, 
and  (ii)  a  procedure’s  return  address  is  never  corrupted,  analyzers  often  tabulate  and  reuse  explicit 
summaries  that  characterize  a  procedure’s  behavior. 

Source-code  analysis  tools  often  use  separate  phases  of  (i)  points-to/alias  analysis  (analysis  of 
addresses)  and  (ii)  analysis  of  arithmetic  operations.  Because  numeric  and  address-dereference  op¬ 
erations  are  inextricably  intertwined,  as  discussed  above,  only  very  imprecise  information  would 
result  if  a  machine-code  analyzer  used  the  same  organization  of  analysis  phases.  Source-code¬ 
analysis  tools  sometimes  also  use  questionable  techniques,  such  as  interpreting  operations  in  in¬ 
teger  arithmetic,  rather  than  bit-vector  arithmetic.  They  also  usually  make  assumptions  about  the 
semantics  that  are  not  true  at  the  machine-code  level — for  instance,  they  usually  assume  that  the 
area  of  memory  beyond  the  top-of-stack  is  not  part  of  the  execution  state  at  all  (i.e.,  they  adopt  the 
fiction  that  such  memory  does  not  exist). 
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2.1.1  CodeSurfer/x86 

Because  the  problem  of  analyzing  executables  to  recover  information  about  their  execution 
properties  has  been  receiving  increased  attention,  several  techniques  for  analyzing  machine  code 
have  been  developed.  However,  much  of  this  work  has  focused  on  specialized  analyses  to  identify 
aliasing  relationships  [80],  data  dependences  [36, 70],  targets  of  indirect  calls  [79],  values  of  strings 
[68],  bounds  on  stack  height  [159],  and  values  of  parameters  and  return  values  [190]. 

In  contrast  to  such  specialized  analyses,  Balakrishnan  and  Reps  [38,  41]  developed  ways  to 
address  all  of  these  problems  by  means  of  an  analysis  that  discovers  an  over-approximation  of 
the  set  of  states  that  can  be  reached  at  each  point  in  the  executable — where  a  state  means  all  of 
the  states:  values  of  registers,  flags,  and  the  contents  of  memory.  Their  techniques  have  been 
incorporated  into  CodeSurfer/x86  [5]. 

They  have  primarily  been  concerned  with  the  analysis  of  stripped  executables  (i.e.,  neither 
source  code  nor  symbol-table/debugging  information  is  available),  both  because  it  is  the  most 
challenging  situation,  and  because  it  is  what  is  needed  in  the  common  situation  where  one  needs  to 
install  a  device  driver  or  commercial  off-the-shelf  application  delivered  as  stripped  machine  code. 
If  an  individual  or  company  wishes  to  vet  such  programs  for  bugs,  security  vulnerabilities,  or  ma¬ 
licious  code  (e.g.,  back  doors,  time  bombs,  or  logic  bombs),  analysis  tools  for  stripped  executables 
are  required. 

Some  of  the  main  analyses  incorporated  into  CodeSurfer/x86  can  be  summarized  as  follows: 
VSA  VSA  (' Value-Set  Analysis)  provides  useful  information  about  memory  accesses  in  an  exe¬ 
cutable.  VSA  is  a  combined  numeric-analysis  and  pointer-analysis  algorithm  that  determines 
a  safe  approximation  of  the  set  of  numeric  values  or  addresses  that  each  register  and  abstract 
memory  location  ( a-loc )  holds  at  each  program  point.  In  particular,  at  each  program  point, 
VSA  provides  information  about  the  contents  of  registers  that  appear  in  an  indirect  memory 
operand;  this  permits  it  to  determine  the  addresses  that  are  potentially  accessed,  which,  in 
turn,  permits  it  to  determine  the  potential  effects  on  the  state  of  an  instruction  that  contains 
an  indirect  memory  operand. 
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A  key  feature  of  VSA  is  that  it  tracks  integer-valued  and  address-valued  quantities  simul¬ 
taneously.  This  is  cmcial  for  analyzing  executables  because  numeric  values  and  addresses 
are  indistinguishable  at  runtime. 

ASI  ASI  ( Aggregate  Structure  Identification )  [42]  is  a  unification-based,  flow-insensitive  algorithm 
to  identify  the  structure  of  aggregates  in  a  program.  Whenever  a  read  or  write  to  a  part  of  a 
memory  object  is  encountered,  ASI  records  how  the  memory  object  is  to  be  subdivided  into 
smaller  objects  that  are  consistent  with  the  memory  access. 

The  remainder  of  this  chapter  presents  two  analyzers  that  I  developed  that  made  use  of,  and 
extended,  CodeSurfer/x86.  §2.2  describes  FFE/x86,  which  is  a  static-analysis  tool  for  extracting  an 
over-approximation  of  a  program’s  output  data  format  from  an  executable.  §2.3  describes  ConSeq, 
which  is  a  consequence-oriented,  backward-analysis  framework  for  detecting  concurrency  bugs. 
ConSeq  uses  backward  slicing  obtained  from  CodeSurfer/x86  to  identify  shared  memory  reads  that 
might  impact  each  potential  error  site.  §2.2  and  §2.3  describe  work  that  extended  CodeSurfer/x86. 
§2.4  discusses  the  drawbacks  of  that  approach,  and  presents  the  research  goals  for  the  work  on  the 
TSL  system. 

2.2  File-Format  Extractor  (Ffe/x86) 

Reverse  engineering  helps  one  gain  insight  into  a  program’s  internal  workings.  It  is  often 
performed  to  retrieve  the  source  code  of  a  program  (e.g.,  because  the  source  code  was  lost),  to 
analyze  a  program  that  may  be  malicious  (such  as  a  virus),  to  fix  a  bug,  to  improve  the  performance 
of  a  program,  and  so  forth.  This  section  describes  a  reverse-engineering  tool  that  can  help  a  human 
understand  what  a  program  produces  as  its  output. 

The  technique  presented  in  this  section  promotes  the  reuse  of  components  of  a  tool  chain. 
For  example,  when  a  software  engineer  wants  to  build  a  program  that  can  process  the  files  that 
a  COTS  software  product  generates,  he  can  use  our  tool  to  obtain  information  about  the  format 
specification,  which  would  be  useful  when  creating  a  program  that  can  act  as  a  substitute  consumer 
(or  producer). 
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Not  all  reverse-engineering  activities  are  legal.  One  of  the  legal  uses  of  reverse  engineering 
is  to  obtain  functional  specifications  needed  for  interoperability  [29]  ;3  hence,  the  activity  that  our 
tool  carries  out  would  generally  be  considered  a  legitimate  one. 

The  technique  presented  here  might  also  be  useful  in  malware  detection.  For  instance,  when 
trying  to  identify  live  versions  of  the  same  malware,  one  would  like  to  have  a  way  to  figure  out  the 
format  of  its  network  traffic.  Our  technique  can  provide  help  with  this  problem. 

Furthermore,  our  technique  can  provide  a  summary  of  a  program’s  behavior:  it  produces  a 
structure  that  consists  of  a  reduced  number  of  entities  (compared  with  the  call  graph  for  instance), 
which  may  make  it  easier  to  understand  what  the  program  is  doing. 

We  first  construct  a  hierarchical  finite-state  machine  [16,  34,  35]  (HFSM)  that  represents  a 
preliminary  format  structure,  as  explained  in  §2.2.3. 1.  However,  an  HFSM  can  be  difficult  to 
understand,  so  to  increase  the  understandability  of  the  results,  we  experimented  with  the  appli¬ 
cation  of  several  transformations  (including  simplification  and  regularization)  to  create  an  over¬ 
approximation  of  the  HFSM  as  an  ordinary  finite-state  machine  (FSM),  which  represents  a  further 
over-approximation  of  the  output  data  format.  This  can  be  used  to  present  the  final  results  either 
as  an  FSM  or  as  a  regular  expression. 

The  contributions  of  the  work  described  in  this  section  are: 

•  It  provides  a  technique  for  extracting  an  over- approximation  of  a  program’s  output  data 
format,  including 

-  a  way  to  extract  a  preliminary  structure  for  the  output  data  format  (§2.2.3) 

-  a  way  to  elaborate  the  structure  by  annotating  it  with  information  about  possible  output 
values  and  sizes  (§2.2.4) 

-  a  way  to  simplify  the  structure  to  provide  greater  understanding  of  the  output  data 
format  (§2.2.5) 

This  provides  information  that  can  lead  to  greater  understanding  of  a  program’s  behavior. 

3When  a  COTS  (Commercial  Off-The-Shelf)  tool  uses  a  proprietary  file  format,  interoperability  can  be  inhibited: 
the  tool  can  only  be  used  in  a  tool  chain  with  a  consumer  or  producer  of  files  that  have  that  format. 
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•  We  report  experimental  results  from  applying  FFE/x86  on  three  applications.  Our  experi¬ 
ments  uncovered  a  possible  bug  in  png2ico  (see  [127]  for  details). 

Although  we  have  concentrated  on  the  problem  of  extracting  output  file  formats  from  executables, 
the  same  approach  could  be  applied  to  source  code  (where  one  could  also  take  advantage  of  infor¬ 
mation  about  the  program’s  variables  and  their  declared  types),  as  well  as  to  extracting  input  file 
formats. 

The  remainder  of  this  section  is  organized  as  follows:  §2.2.1  discusses  the  key  observations 
that  inspired  our  work  on  FFE/x86  and  the  assumptions  for  our  approach.  §2.2.3  explains  the 
process  of  constructing  a  structure  for  the  output  data  format,  and  also  provides  an  overview  of 
the  infrastructure  on  which  our  implementation  is  based.  §2.2.4  discusses  how  to  elaborate  the 
structure  generated  from  the  first  step  with  static  analyses.  §2.2.5  presents  a  series  of  filtering 
operations  for  making  HFSMs  more  understandable.  §2.2.6  describes  how  we  validated  FFE/x86. 
§2.2.7  presents  experimental  results.  §2.2.8  describes  related  work.  §2.2.9  describes  possible  future 
directions. 

2.2.1  Programming  Styles 

This  section  makes  a  few  observations  about  programming  styles  used  in  typical  application 
programs  to  produce  output  data. 

Programming  styles  relevant  to  writing  output  data  can  be  categorized  as  individual  writes  and 
bulk  writes.  We  present  different  approaches  tailored  to  handle  them  in  later  sections.  (Some 
programs  use  both  styles;  our  tool  is  capable  of  handling  such  programs,  as  well.) 

Individual  writes.  The  first  programming  style  is  to  write  individual  data  items  out  separately 
to  a  file  or  a  network.  Standard  I/O  functions,  such  as  fputs  and  fputc  in  C  programs,  could  be 
used.  In  practice,  however,  wrapper  functions  tend  to  be  frequently  used.  Fig.  2.1(a)  shows  an 
example  of  this  programming  style  using  wrapper  functions,  such  as  put_byte,  put_long,  and 
writes.  Several  fields  of  the  output,  including  magic  numbers,  types,  sizes,  and  a  checksum,  are 
written  out  by  calling  wrapper  functions.  These  functions  provide  an  API  to  append  output  items 
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[1]  void  put_byte(char  c)  {  ...} 

[2]  void  put_long(long  c)  {  ...} 

[3]  void  write_bytes(char*  c,  int  n)  {  ...} 

[4]  void  type  ()  { 

[5] 

[6]  switch ( . . . )  { 

[7]  case  0:  put.byte ( 5 a’ ) ;  break; 

[8]  case  1:  put_byte(  ’b5 ) ;  break; 

[9]  } 

[10] } 

[11]  void  chksum  ()  { 

[12]  .... 

[13]  put_long( . . . ) ; 

[14] } 

[15]  void  fill.dataO  { 

[16]  .... 

[17]  while(c)  { 

[18]  put_byte(c); 

[19]  } 

[20] } 

[21]  void  mainO  { 

[22]  .  .  . 

[23]  put  .long (magic 1) 

[24]  put_long(magic2) 

[25]  write.bytes (filename,  sizeof (filename) ) ; 

[26]  type  () ; 

[27]  put_long(size) ; 

[28]  chksum  O', 

[29]  return  0; 

[30] } 


[1]  struct  header  { 

[2]  byte  magic [2] ; 

[3]  char  name [100]; 

[4]  char  type ; 

[5]  long  size; 

[6]  long  chksum; 

[7]  } 

[8]  void  write.fileO  { 

[9]  struct  header*  h; 

[10]  h  =  (struct  header*)malloc( . . . ) ; 

[11]  h->magic[0]  =  ...; 

[12]  strcpy(h->name ,  ...); 

[13]  h->type  =  . . . ; 

[14]  h->size  =  . . . ; 

[15]  h->chksum  =  . . . ; 

[16]  fwrite(fp,  sizeof (struct  icmphdr) ,  1,  h) ; 

[17]  write.dataO ; 

[18]  .  .  . 

[19]} 


(a)  (b) 

Figure  2.1  (a)  An  example  that  uses  individual  writes,  (b)  An  example  of  a  bulk  write 


to  an  internal  buffer;  once  the  whole  buffer  has  been  filled,  the  contents  of  the  buffer  are  flushed. 
Whereas  the  buffer  is  written  out  in  bulk,  the  individual  calls  to  the  wrapper  functions  represent 
the  “individual  writes”  referred  to  in  our  name  for  this  style.  We  refer  to  both  the  standard  I/O 
functions  and  user-defined  wrapper  functions  as  output  functions. 
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An  output  operation  is  an  operation  relevant  to  generating  an  output  data  object.  Specifically, 
the  term  output  operation  is  defined  as  a  call  site  that  calls  an  output  function — either  a  standard 
I/O  library  function  or  a  wrapper  function  (see  lines  7,  8,  13,  18,  23,  24,  25,  and  27  in  Fig.  2.1(a)). 

Our  experience  so  far  is  that  many  application  programs  are  coded  in  this  programming  style. 
For  instance,  gzip  [15],4  corapress95  [6],  and  png2ico  [20]  follow  such  a  programming  style. 

Bulk  writes.  The  second  programming  style  is  to  use  structs  or  classes  to  manipulate  headers. 
Fig.  2.1(b)  shows  an  example  of  using  a  header  structure  to  write  output  data.  A  header  struct 
object  is  created  at  line  10.  Each  field  of  the  struct  is  set  to  some  value  in  lines  11-15.  Finally, 
at  lines  16-17,  the  object  is  written  out  to  the  file  in  its  entirety.  In  this  programming  style,  calls 
like  the  one  to  f write  are  the  output  operations. 

In  practice,  we  observed  that  tar  [24]  and  cpio  [8]  use  such  aggregate  structures  as  storage  in 
preparation  for  a  bulk  write.  We  suspect  that  this  style  would  be  used  for  more  than  just  headers 
by  applications  whose  output  files  consist  of  a  sequence  of  records. 

2.2.2  User-Supplied  Information 

In  our  current  implementation,  the  user  must  identify  the  output  functions  and  supply  some 
additional  information  about  them,  in  particular,  information  about  each  output-relevant  parameter: 

•  whether  it  is  a  numeric  value  to  be  written  out 

•  whether  it  is  an  address  pointing  to  the  memory  containing  the  data  to  be  written  out 

•  whether  it  indicates  how  many  bytes  are  written  out 

See  §2.2.4. 1  for  more  details.  In  the  case  of  standard  I/O  functions,  such  information  is  already 
known. 

4Because  the  gzip  source  uses  macros  instead  of  functions,  output  operations  are  not  call  sites  in  the  gzip  ex¬ 
ecutable.  This  is  not  compatible  with  our  approach  of  having  the  user  identify  the  output  operations  by  supplying 
the  names  of  output  functions.  To  convert  gzip  into  an  example  in  which  output  operations  are  visible  as  procedure 
calls — so  that  it  could  be  used  for  proof  of  concept  in  our  experimental  study — we  modified  the  gzip  source  code  to 
change  all  output  macro  definitions  into  explicit  functions.  Automatically  identifying  low-level  code  fragments  that 
represent  output  operations  remains  a  challenging  problem  for  future  work. 
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2.2.3  First  step 

In  our  approach,  a  Hierarchical  Finite  State  Machine  (HFSM)  is  used  to  represent  an  output 
data  format.  An  HFSM  is  a  structure  in  which  nesting  of  finite  automata  within  states  is  allowed 
[34,  35].  An  HFSM  captures  commonalities  by  organizing  states  in  such  a  hierarchy.  Note  the 
following  two  points  about  HFSMs: 

•  The  languages  of  paths  in  recursive  HFSMs  are  exactly  the  context-free  languages. 

•  The  languages  of  paths  in  non-recursive  HFSMs  are  the  regular  languages. 


foo  bar  baz 


call 

(b) 

call 


Figure  2.2  (a)  An  FSM,  (b)  A  hierarchical  FSM. 

However,  non-recursive  hierarchical  FSMs  can  be  exponentially  more  succinct  than  conven¬ 
tional  FSMs  due  to  sharing,  as  illustrated  in  Fig.  2.2. 

2.2.3. 1  Construction  of  an  HFSM 

We  will  use  the  code  fragment  shown  in  Fig.  2.1(a)  to  explain  our  approach.  The  code  emulates 
an  archive  utility,  such  as  tar.  It  writes  two  magic  numbers,  followed  by  the  file’s  name,  layout  type, 
size,  and  check-sum,  using  wrapper  functions.  Fig.  2.4  shows  its  disassembled  code  as  generated 
by  IDAPro  [18],  a  commercial  disassembly  toolkit. 

Each  procedure  involved  with  at  least  one  output  operation  gives  rise  to  an  FSM.  The  pro¬ 
gram’s  wrapper  functions  include  put_byte  (sub_401050  in  the  disassembled  code),  put_long 
(sub_401075),  and  writes  (sub_4010E4),  and  calls  to  these  functions  represent  output  operations. 
FFE/x86  finds  the  output  operations  and  constructs  a  hierarchical  finite-state  machine  [16,  34,  35] 
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(HFSM)  based  on  the  control-flow  graphs  (CFGs)  provided  by  the  CodeSurfer/x86  framework 
mentioned  in  the  introduction  of  this  chapter  [5].  Our  analyzer  creates  a  reduced  interprocedural 
control-flow  graph  (i.e.,  the  HFSM)  that  is  the  projection  of  the  interprocedural  control-flow  graph 
onto  enter  nodes,  exit  nodes,  call  nodes,  and  output  operations. 

Fig.  2.3  shows  the  outcome  from  running  FFE/x86.  Each  node  in  the  HFSM  is  either  an  output 
operation  (such  as  4011B3)  or  a  call  site  (such  as  4011D6)  to  a  sub-FSM  (such  as  type).  A  call-site 
node,  which  represents  a  call  to  a  sub-FSM,  implicitly  connects  the  two  FSMs  in  the  HFSM. 

The  HFSM  generated  by  our  tool  for  gzip  is  shown  in  Fig.  2.5(a).  Our  thesis  is  that  HFSMs 
(including  elaborations  and  refinements  of  HFSMs,  as  explained  in  §2.2.4  and  §2.2.5)  provide  a 
basis  for  gaining  an  understanding  of  the  program’s  behavior.  In  this  regard,  it  is  instructive  to 
compare  the  HFSM  with  the  program’s  call  graph,  because  a  call  graph  is  another  structure  that  a 
programmer  may  use  to  gain  a  high-level  understanding  of  a  program. 

Fig.  2.5(b)  shows  a  part  of  the  call  graph  for  gzip.  Gzip  is  composed  of  114  control-flow 
graphs  (CFGs),  11491  CFG  nodes,  and  625  call  sites.  Even  though  the  HFSM  produced  by  our 
tool  appears  to  be  quite  complicated,  it  is  substantially  less  complicated  than  both  the  program’s 
call  graph  and  its  interprocedural  control-flow  graph:  the  HFSM  for  gzip  has  12  FSMs,  64  nodes, 
and  36  call  sites. 

2.23.2  Existing  Infrastructure 

FFE/x86  uses  intermediate  representations  (IRs)  provided  by  the  CodeSurfer/x86  framework 
(Fig.  2.6),  which  provides  an  analyst  with  a  powerful  and  flexible  platform  for  investigating  the 
properties  and  behaviors  of  x86  executables  [5].  As  described  in  the  introduction  of  this  chapter, 
CodeSurfer/x86  includes  several  static  analyses,  including  VSA  and  ASI. 

VS  A  is  a  combined  numeric-analysis  and  pointer- analysis  algorithm  that  determines  an  over¬ 
approximation  of  the  set  of  numeric  values  and  addresses  that  each  memory  location  holds  at  each 
program  point  [41].  ASI  recovers  information  about  variables  and  types,  especially  for  aggregates, 
including  arrays  and  structs.  The  variables  recovered  by  ASI  are  used  by  VSA  to  obtain  information 
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I  401160 

"^Q+jcall  sub_401075|- 
(putjong) 


Figure  2.3  The  HFSM  for  Fig.  2.1(a).  The  shaded  boxes  signify  calls  to  FSMs.  Dotted  lines 

indicate  implicit  connections  between  FSMs. 
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Figure  2.4  The  disassembled  code  for  Fig.  2.1(a).  Transparent  boxes  indicate  output  operations, 

and  shaded  boxes  indicate  calls  to  sub-FSMs. 


about  the  variables’  possible  values.  The  values  recovered  by  VSA  are  used  by  ASI  to  identify  a  re¬ 
fined  set  of  variables.  Thus,  CodeSurfer/x86  runs  VSA  and  ASI  repeatedly,  either  until  quiescence, 
or  until  some  user-supplied  bound  is  reached.5 

5 If  VSA  and  ASI  have  not  quiesced  when  the  bound  is  reached,  it  is  still  safe  to  use  the  results  from  the  final  round 
of  VSA.  In  particular,  each  round  of  VSA  provides  an  over-approximation  of  the  set  of  numeric  values  and  addresses 
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Figure  2.5  (a)  The  HFSM  for  gzip.  (b)  a  fragment  of  the  call  graph  of  gzip. 


CodeSurfer/x86 


Figure  2.6  Organization  of  CoderSurfer/x86,  and  how  FFE/x86  interacts  with  its  components. 


CodeSurfer/x86  uses  an  initial  estimate  of  the  program’s  variables,  the  call  graph,  and  control- 
flow  graphs  (CFGs)  for  the  program’s  procedures  provided  by  IDAPro.  IDAPro  itself  does  not 
identify  the  targets  of  all  indirect  jumps  and  indirect  calls,  and  therefore  the  call  graph  and  control- 
flow  graphs  that  it  constructs  are  not  complete.  In  contrast,  CodeSurfer/x86  uses  the  values  that 
VSA  discovers  to  resolve  indirect  jumps  and  indirect  calls,  and  thus  is  able  to  supply  an  over¬ 
approximation  to  the  call  graph. 


for  each  memory  location,  modulo  the  treatment  of  possible  memory-safety  violations — some  of  which  may  be  due  to 
loss  of  precision  during  VSA.  See  [41]  for  more  details. 
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§2.2.4  discusses  other  ways  in  which  VSA  and  ASI  can  be  exploited  for  our  purposes. 

2.2.4  Augmenting  an  HFSM  with  Information  from  Static  Analyses 

In  this  section,  we  explain  how  to  exploit  the  static  analyses  mentioned  in  §2. 2. 3. 2  for  elabo¬ 
rating  HFSMs. 

2.2.4. 1  Value  Set  Analysis 

The  HFSM  generated  by  the  method  described  in  §2.2.3. 1  provides  some  information  for  un¬ 
derstanding  an  output  format.  The  HFSM  can  be  made  more  precise  by  annotating  it  with  addi¬ 
tional  information.  In  particular,  we  wish  to  label  each  node  with  information  about: 

•  the  size  (in  bytes)  of  the  data  that  the  node  represents,  and 

•  an  over-approximation  of  the  value  written  out. 


void  put.byte (char  c)  { 

mov  byte  ptr[esp],  lFh 

outbuf [outcnt++]  =  (uch) (c) ; 

call  put_byte 

if (outcnt==OUTBUFSIZE) 

f  lush.outbuf  ()  ; 

} 

(a)  (b) 

Figure  2.7  An  example  code  fragment;  put_byte  is  a  output  function,  and  call  sites  that  call  it 

are  output  operations. 


The  values  of  interest  are  the  actual  parameters  corresponding  to  the  formal  parameters  of  out¬ 
put  functions.  For  example,  suppose  that  put_byte  is  one  of  the  output  functions  (see  Fig.  2.7(a)). 
Suppose  that  at  one  of  the  call  sites  that  calls  put_byte  (i.e.,  at  one  of  the  output  operations),  the 
actual  parameter  is  always  lFh  (see  Fig.  2.7(b)).  This  information  can  be  obtained  from  the  infor¬ 
mation  collected  by  VSA.  Note  that  at  the  call  on  put_byte,  the  relevant  value  is  stored  on  the  stack 
in  the  byte  pointed  to  by  esp.  The  abstract  memory  configuration  (AMC)  that  VSA  would  have  for 
the  call  site  would  indicate  this:  for  instance,  Fig.  2.8(a)  illustrates  the  values  that  the  AMC  would 
contain  in  this  example.  In  particular,  our  tool  is  able  to  obtain  an  over- approximation  of  the  set 


38 


of  values  that  the  actual  may  hold  by  evaluating  the  operand  expression  [esp]  in  the  AMC,  which 
amounts  to  looking  up  in  the  AMC  the  contents  of  the  cell  (or  cells)  that  esp  may  point  to.  (For 
this  example,  the  result  would  be  a  singleton  set,  namely,  { lFh}.) 


value - 


(a) 
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1003 

1002 

1001 


"lFh 


t;ize : 
esp 


value4 
value3 
value2 
value  1 


1000 


(b) 


>size:4 
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H-buf  PTR 


-size:? 


h*  SIZE  x  COUNT  =  --■ 
H  _the  number  of  bytes 
X*  to  be  ^vritten  out 


-  esp 


(c) 


Figure  2.8  How  to  obtain  information  from  VSA. 


There  are  two  kinds  of  parameters  that  can  be  passed  into  a  output  function:  numeric  values 
and  addresses. 

Numeric  values.  The  case  where  an  actual  parameter  holds  a  numeric  value  has  already  been 
explained  above  (see  Fig.  2.8(a)).  The  corresponding  size  of  the  value  can  be  obtained  from  AS  I, 
which  infers  the  size  from  the  usage  pattern  of  the  formal  parameter  in  the  called  function.  (In  the 
case  where  an  output  operation  calls  a  standard  I/O  function,  this  information  is  available  from  the 
signature  of  the  function.)  For  example,  put_byte  would  have  a  1-byte  argument,  put_short  a 
2-byte  argument,  and  so  forth. 

Addresses.  If  the  type  of  a  formal  parameter  is  a  pointer,  the  set  of  addresses  in  the  memory 
location  corresponding  to  the  actual  parameter  would  be  used  to  look  up  in  the  AMC  the  values  in 
the  cells  to  which  the  actual  parameter  could  point  (see  Fig.  2.8(b)). 

The  case  of  fwrite  at  lines  16-17  in  Fig.  2.1(b)  falls  into  this  category.  The  address  of  the 
heap-allocated  memory  location  that  contains  the  data  is  passed  as  the  first  argument. 
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size_t  fwrite(const  void  *BUF_PTR ,  size_t  SIZE,  size_t  COUNT,  FILE  *FP ); 

It  is  known  that  the  product  of  the  second  and  third  parameters  of  f  write  is  the  number  of  bytes 
that  are  written  out  (see  Fig.  2.8(c)). 

Value  roles.  The  kind  of  abstract  value  recovered  by  VSA  sometimes  suggests  what  the  value’s 
role  is,  e.g., 

•  Singleton  -  If  VSA  recovers  a  singleton  value  for  an  actual  parameter  of  an  output  opera¬ 
tion,  the  parameter  may  correspond  to  either  a  magic  number  or  a  reserved  field. 

•  Set  of  numeric  values  -  If  the  value  that  VSA  recovers  is  a  non-singleton  set  of  numeric 
values,  the  parameter  may  correspond  to  an  optional  field. 

•  Top  -  If  VSA  gives  Top,  which  means  any  value,  for  an  actual  parameter  of  an  output  opera¬ 
tion,  the  parameter  may  correspond  to  variant  data. 

2.2.4.2  Aggregate  Structure  Identification 

As  mentioned  in  §2.2.1,  programmers  frequently  use  a  struct  or  a  class  to  collect  data  before 
it  is  written  out. 

Fig.  2.9  shows  a  fragment  from  ping  [19]  in  which  a  network  packet  is  constructed.  Instead 
of  writing  individual  data  items  one  at  a  time  using  output  operations,  a  struct  object  is  used  to 
store  output  data  while  multiple  fields  are  prepared,  as  shown  in  lines  7-11  of  Fig.  2.9.  Then  the 
aggregate  object  is  written  out  (i.e.,  sent  out)  all  together  on  lines  13-14. 

AS  I  [155]  is  a  unification-based,  flow-insensitive  algorithm  to  identify  the  structure  of  aggre¬ 
gates  in  a  program.  Whenever  a  read  or  write  to  a  part  of  a  memory  object  is  encountered,  AS  I 
records  how  the  memory  object  should  be  subdivided  into  smaller  objects  that  are  consistent  with 
the  memory  access. 

In  this  example,  we  assume  that  the  user  has  indicated  that  sendto,  which  is  a  GNU  C  library 
function,  is  the  only  output  function.  The  second  argument  of  sendto  is  known  to  be  a  pointer  to  a 
struct  object  with  unknown  substructure.  ASI  provides  information  about  this  substructure.  The 
instructions  that  correspond  to  the  assignment  statements  at  lines  7-11  of  Fig.  2.9  are  shown  in 
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[1]  u_char  outpack  [MAXPACKET] ; 

[2]  static  void  pinger(void)  { 

[3]  register  struct  icmphdr  *icp; 

[4]  register  int  cc; 

[5]  int  i ; 

[6]  icp  =  (struct  icmphdr*)outpack; 

[7]  icp->icmp_type  =  ICMP .ECHO; 

[8]  icp->icmp_code  =  0; 

[9]  icp->icmp_cksum  =  0; 

[10]  icp->icmp_seq  =  ntransmitted++; 

[11]  icp->icmp_id  =  ident; 

[12]  .  .  . 

[13]  i  =  sendto(s,  (char*) outpack,  cc,  0,  &whereto, 

[14]  sizeof (struct  sockaddr)); 

[15]  ... 

[16] } 

Figure  2.9  Code  fragment  used  to  illustrate  the  use  of  ASI  information. 


Fig.  2.10(a)  at  lines  2,  4,  6,  9,  and  13,  respectively.  VSA  provides  information  about  the  extent  of 
memory  accessed  by  each  of  these  instructions.  ASI  uses  that  information  to  subdivide  the  portion 
of  memory  accessed,  thereby  producing  the  structure  shown  in  Fig.  2.10(b).  This  indicates  that  the 
structure  of  the  packet  header  may  consist  of  two  1-byte  fields,  followed  by  three  2-byte  fields. 

ASI  is  also  capable  of  recovering  information  about  the  structure  of  aggregates  that  are  allocated 
in  the  heap. 

This  example  illustrates  a  case  where  each  output  function  emits  a  completely-constructed 
chunk  of  output  data,  and  the  HFSM  represents  the  program’s  output  operations  at  a  high  level 
of  abstraction.  In  bulk  writes  as  this  example,  structure  information  recovered  by  ASI  can  help 
identify  the  structure  of  output  data  format. 

2.2.5  Filtering 

Because  an  HFSM  can  be  hard  to  understand,  we  experimented  with  applying  a  series  of  fil¬ 
tering  operations — including  simplification,  conversion  of  each  FSM  to  a  regular  expression,  and 
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[1]  mov  eax,  dword  ptr  [ebp  -  lOh] 

Global: 

[2]  mov  byte  ptr  [eax]  ,  8 

struct  { 

[3]  mov  edx,  dword  ptr  [ebp  -  lOh] 

[4]  mov  byte  ptr  [edx  +  1]  ,  0 

byte_l  outpack.O; 

[5]  mov  eax,  dword  ptr  [ebp  -  lOh] 

byte_l  outpack.l; 

[6]  mov  word  ptr  [eax  +  2]  ,  0 

byte_2  outpack.2; 

[7]  mov  eax,  dword  ptr  [ntransmitted] 

byte_2  outpack.4; 

[8]  mov  edx,  dword  ptr  [ebp  -  lOh] 

byte_2  outpack.6; 

[9]  mov  word  ptr  [edx  +  6]  ,  ax 

[10] inc  dword  ptr  [ntransmitted] 

} 

[11] mov  eax,  dword  ptr  [ident] 

[12] mov  edx,  dword  ptr  [ebp  -  lOh] 

[13]  mov  word  ptr  [edx  +  4]  ,  ax 

(a) 

(b) 

Figure  2.10  (a)  The  disassembled  code  fragment  for  Fig.  2.9,  (b)  The  outcome  of  ASI. 


inline  expansion — to  generate  a  simpler  representation  of  the  output  format  as  a  regular  expression. 
In  our  experiments,  this  has  been  done  manually;  however,  the  process  would  be  relatively  easy  to 
automate. 

Simplification.  Not  all  nodes  in  the  HFSM  are  helpful  in  understanding  an  output  format.  An 
unnecessarily  complicated  HFSM  could  prevent  users  from  understanding  key  aspects  of  an  output 
format. 

Most  portions  of  the  HFSM  shown  in  Fig.  2.5(a)  turn  out  to  be  either  Top-value,  Top-size, 
or  an  unbounded  loop  that  includes  them.  Top- value  means  that  the  node  could  have  any  value; 
Top-size  means  that  the  node  could  be  of  any  size. 

In  each  of  the  following  cases,  a  node  (or  a  node  set)  would  not  provide  meaningful  informa¬ 
tion: 

•  A  node  of  Top-size  and  Top-value 

•  A  node  set  in  an  unbounded  loop,  each  of  which  has  both  Top-size  and  Top-value 
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To  be  considered  as  a  meaningful  node,  a  node  must  be 
•  A  node  of  non-Top-size 

Algorithm  1  Simplification  algorithm. 

Require:  HFSM 

Ensure:  Trimed  HFSM 

Set  the  status  of  all  FSMs  to  be  meaningful 

while  There  exists  a  meaningful  FSM  that  contains  only  non-meaningful  nodes  or  calls  to  non¬ 
mean  in  gful  FSMs  do 

Set  M  to  be  a  non-meaningful  FSM 

Transform  M  into  an  FSM  with  a  self-loop  on  a  node  labeled  with  (Top-size/Top-value) 

end  while 


Alg.  1  describes  an  algorithm  for  simplifying  HFSMs  generated  by  FFE/x86.  The  idea  behind 
the  algorithm  is  to  consider  the  cases  mentioned  above:  for  an  FSM  that  consists  of  only  nodes  with 
Top-value  and  Top-size,  or  an  unbounded  loop  that  includes  only  such  items,  it  may  be  better  to 
simplify  it  to  (Top)*  because  the  original  FSM  would  not  provide  much  meaningful  information 
about  the  output  format. 


size:  V 

1 '  size:  ] 

Top 

Top 

value: 

value: 

Top  , 

.  T°P  . 

r 


size: 
Top 
value: 
T°P  . 


Figure  2.1 1  An  example  of  simplification. 


Fig.  2.11  shows  an  example  of  simplification.  The  shaded  FSM  that  contains  two  non- 
meaningful  FSM s  and  three  non-meaningful  nodes  is  simplified  to  an  unbounded  self-loop  con¬ 
sisting  of  a  node  (Top-size/Top-value). 

Conversion  to  a  regular  expression.  We  can  convert  each  FSM  in  an  HFSM  into  a  regular 
expression  using  the  Kleene  construction. 
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Expansion.  The  final  step  is  to  apply  inline  expansion.  Recursion  was  not  encountered  in  any 
of  the  applications  that  we  used  for  our  experiments  (see  §2.2.7),  so  inline  expansion  could  be 
applied  without  worrying  about  non-termination.  If  recursion  had  been  encountered,  we  could 
have  summarized  strongly  connected  components  of  the  call  graph. 

Fig.  2.12  represents  the  final  outcome  from  using  these  techniques  on  our  example. 


iEESi 

jggai 

1 

m 

mtm 

|n 

■V 

1 

SB 

m£ 9 

■ 

jEsn 

JEESL 

Figure  2.12  The  final  result  after  simplification,  conversion,  and  inline  expansion. 


2.2.6  Validation  against  dynamic  output 

We  validated  our  approach  by  testing  whether  the  outcome  from  our  algorithm  (i.e.,  the  regular 
expression)  matches  output  data  produced  during  actual  runs  of  the  application. 

We  usedyfer  [11],  a  tool  for  generating  scanners  for  compilers.  Given  an  input  specification  in 
the  form  of  a  list  of  pattern-action  pairs  (where  the  pattern  is  a  regular  expression),  flex  generates 
a  program  that  repeatedly  finds  the  longest  prefix  of  the  (remaining)  input  that  matches  one  of  the 
patterns.  To  create  a  tool  for  testing  whether  a  regular  expression  R  generated  by  our  algorithm 
describes  the  output  of  an  application,  we  gave  flex  a  2-pattem  specification — consisting  of  R  (with 
an  action  to  report  success),  plus  a  default  pattern  (with  an  action  to  report  failure). 

As  discussed  earlier,  each  box  (as  shown  in  Fig.  2.12)  in  the  regular  expression  generated  by 
our  technique  is  labeled  with  two  kinds  of  information:  a  value  and  a  size.  Value  and  size  are 
either  Top,  a  Singleton,  or  a  set  of  numeric  values.  Thus,  to  be  able  to  feed  it  to  flex,  the  regular 
expression  needs  to  be  transformed  to  one  in  which  the  basic  unit  is  a  l-byte  character.  Tab.  2.1 
shows  the  transformation  rules  that  are  applied  to  boxes.6 


6  We  use  V  as  a  shorthand  for  “any  character”.  In  flex,  it  is  necessary  to  use  the  pattern  ‘.|\n’. 


44 


Table  2.1  Transformation  of  boxes. 


size 

value 

conversion 

Singleton  n 

Singleton 

According  to  the  value  of  n,  this  is  split  into  multiple  boxes  that  contain  a  1-byte  value. 

(E.g.,  the  first  box  in  Fig.  2.13(a)  is  transformed  to  the  first  four  boxes  in  Fig.  2.13(b).) 

Singleton  n 

Top 

Top  is  transformed  to  V,  which  matches  any  character.  Thus,  this  is  transformed  to  a 

sequence  of  n  boxes  that  contain  V.  (E.g.,  the  fifth  box  in  Fig.  2.13(a)  is  transformed 

to  the  last  two  boxes  in  Fig.  2.13(b).) 

Top 

Top 

This  is  transformed  to  a  box  that  contains  V  with  a  self-loop.  (E.g.,  the  third  box  in 

Fig.  2. 13  (a)  is  transformed  to  the  box  that  has  a  loop  in  Fig.  2. 13(b).) 

Figure  2.13  An  example  of  the  transformation.  V  means  any  character. 


Tab.  2. 1  describes  only  the  cases  when  size  and  value  have  either  Singleton  or  Top.  (Note  that 
there  is  no  case  when  size  is  Top  and  the  value  is  non-Top  because  this  is  not  a  possible  outcome  of 
VSA.)  For  the  case  when  either  size,  value,  or  both  have  a  set  of  numeric  values,  we  split  the  box 
into  multiple  boxes  that  have  a  Singleton  value  and  a  Singleton  size.  For  example,  the  second 
box  in  Fig.  2.13(a),  which  has  two  values  (2  and  4),  is  transformed  to  the  two  boxes  in  Fig.  2.13(b) 
that  have  the  values  2  and  4,  respectively.  For  the  case  where  size  is  not  a  Singleton,  the  shaded 
boxes  in  Fig.  2.13(b)  show  how  it  is  converted. 

Note  that  this  process  is  only  for  validation,  because  the  original  values  or  sets  of  values  are 
more  likely  to  be  understandable  to  a  human  than  the  subdivided  values. 
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2.2.7  Experimental  Results 

We  evaluated  FFE/x86  on  three  applications:  gzip,  png2ico,  and  ping.  In  this  chapter,  we 
show  the  result  of  gzip.  All  the  experimental  resuls  are  presented  in  the  WCRE’06  paper  on 
FFE/X86  [127], 

Gzip 

Gzip  is  a  GNU  data-compression  program.  Fig.  2.14  represents  the  outcome  after  filtering  the 
HFSM  from  Fig.  2.5(a). 
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Figure  2.14  The  final  result  for  gzip. 


Table  2.2  Part  of  the  specification  of  gzip’s  format  [14]. 


I&l 

ID2 

CM 

FLG 

MTIME 

XFL 

OS 

If  FLG.FHCRC  set 


...  compressed  blocks  ... 

CRC32 

ISIZE 

IDl  and  ID2 

These  are  the  fixed  values:  ID1=31  (OxlF),  ID2=139  (0x8B) 

CM 

This  identifies  compression  method:  CM=0-7  are  reserved,  CM=8  demotes  the  ’’deflate”  compression  method. 

FLG 

This  is  divided  into  individual  bits:  bit  0  FTEXT,  bit  1  FHCRC  and  so  forth. 

MTIME 

This  gives  the  most  recent  modification  time  of  the  original  file  being  compressed. 

XFL 

This  is  available  for  use  by  specific  compression  methods. 

os 

This  identifies  the  type  of  file  system  on  which  compression  took  place:  0  -  FAT  filesystem,  1  -  Amiga,  and  so  forth. 

CRC32 

This  contains  a  cyclic  redundancy  check  value  of  the  uncompressed  data. 

ISIZE 

This  contains  the  size  of  the  original  input  data  modulo  232 . 

The  format  of  .gz  files  generated  by  gzip  is  described  in  RFC  1952  (see  Tab.  2.2).  The 
outcome  shown  in  Fig.  2.14  correctly  over- approximates  the  specification.  In  other  words,  the 
language  of  the  outcome  is  a  superset  of  the  output  language  of  gzip.  The  outcome  has  the  two 
magic  numbers  (IDl=0xlf  and  ID2=0x8b)  and  a  constant  (CM=8)  at  the  same  positions  shown  in 
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Tab.  2.2.  This  is  followed  by  a  4-byte  element  (corresponding  to  MTIME),  two  1-byte  elements 
(corresponding  to  XFL  and  OS).  At  the  end,  it  has  two  4-byte  elements,  which  correspond  to  CRC32 
and  ISIZE. 

We  also  applied  the  validation  process  described  in  §2.2.6  to  this  outcome.  The  ^fee-generated 
validator  accepted  each  of  five  .  gz  files  (chosen  arbitrarily  from  the  Internet). 

2.2.8  Related  Work  on  Recovering  Input/Output  Information 

Most  previous  work  on  reverse  engineering  of  file  formats  has  been  dynamic  and  manual. 
Eilam  describes  a  strategy  for  deciphering  file  formats  given  a  symbol  table  and  a  sample  output 
file  [83].  This  approach  requires  manually  stepping  through  disassembled  code  and  inspecting 
memory  contents  in  a  debugger  while  the  program  produces  the  given  file.  Other  approaches 
ignore  the  program  and  rely  on  heuristic  generalization  from  one  or  more  sample  output  files. 
For  example,  one  reverse-engineering  case  study  searched  for  zlib-compressed  data,  file  names, 
length  bytes,  and  other  typical  structures  [10].  All  of  these  approaches  require  considerable  manual 
effort  and  one  cannot  guarantee  that  the  chosen  sample  files  are  sufficiently  general.  In  constrast, 
the  static  approach  described  here  over-approximates  a  file  format  without  relying  on  sample  files, 
symbol  tables,  or  extensive  manual  analysis.  Human  intervention  is  only  needed  to  identify  output 
functions  and  to  assign  higher-level  interpretations  (e.g.,  “file  name”  )  to  selected  fields  identified 
by  the  analysis. 

There  have  been  similar  attempts  to  statically  recover  information  about  program  data.  Chris¬ 
tensen  et  al.  have  presented  a  technique  for  discovering  the  possible  values  of  string  expressions  in 
Java  programs  [67].  First,  a  context-free  grammar  is  generated  by  constructing  dependence  graphs 
from  class  files.  The  grammar  is  then  widened  into  a  regular  language,  which  contains  all  possible 
strings  that  could  be  dynamically  generated. 

The  method  of  Christensen  et  al.  has  also  been  applied  to  low-level  code;  Christodorescu  et  al. 
used  the  method  in  a  string  analysis  for  x86  executables  [69] .  This  approach  is  similar  to  ours  in  the 
sense  that  x86  executables  are  the  targets  of  both  tools,  and  the  recovered  output  data  format  in  the 
analysis  is  represented  as  a  regular  language  that  denotes  a  superset  of  the  actual  output  language. 
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Their  approach,  however,  is  different  from  ours  in  the  sense  that  the  initial  context-free  structure 
recovered  by  their  tool  comes  from  the  structure  of  operations  purely  internal  to  each  procedure, 
rather  than  from  the  call-return  structure  of  the  program,  as  in  our  tool. 

Our  approach  is  also  related  to  work  on  host-based  intrusion  detection,  in  which  models  of 
expected  program  behavior  are  also  constructed.  The  model  over- approximates  the  possible  se¬ 
quences  of  system  calls,  and,  by  comparing  the  actual  sequence  of  system  calls  to  those  allowed 
by  the  model,  is  used  to  detect  when  malicious  input  has  hijacked  the  program.  Pushdown-system 
models  have  been  employed  for  this  purpose,  either  constructed  from  source  code  [179]  or  from 
low-level  code  [91,  92]  (in  particular,  SPARC  executables).  Our  HFSMs  are  similar  in  that  they 
also  yield  context-free  languages  that  are  a  projection  of  a  portion  of  the  program’s  behavior.  We 
have  gone  beyond  previous  work  by  using  the  results  from  two  dataflow  analyses  (namely,  VSA 
and  ASI)  to  elaborate  our  models  with  information  about  possible  sets  of  values  and  value  sizes. 

2.2.9  Discussion  of  Ffe/x86 

In  the  work  on  FFE/x86,  we  focus  on  output  operations.  However,  the  same  approach  can  be 
applied  to  other  kinds  of  operations.  For  example,  one  could  treat  input  operations,  which  are  asso¬ 
ciated  with  examining  or  parsing  an  input  file,  using  the  same  approach  taken  by  FFE/x86  [81].  In 
this  case,  one  would  want  to  consider  only  paths  to  exit  points  that  represent  successful  runs  of  the 
program  (because  these  correspond  to  successful  uses  of  well-formed  input  files).  In  addition,  one 
could  apply  our  approach  to  network-communication  operations  that  parse  or  construct  packets. 

It  may  be  possible  to  use  such  a  characterization  of  the  input  language  as  a  way  to  generate 
test  inputs.  Similarly,  knowledge  of  the  output  language  for  component  c\  in  a  tool  chain  could  be 
used  as  a  source  of  test  inputs  for  the  next  component  c2  in  the  chain. 

As  mentioned  earlier,  we  assume  that  output  functions  are  identified  by  the  user.  To  create  a 
more  automatic  tool  for  extracting  data  formats,  it  would  be  desirable  to  find  a  way  to  automatically 
identify  output  functions,  especially  wrapper  functions. 

Each  loop  in  an  HFSM  is  currently  transformed  to  either  (node-set)*  or  (node-set)  +  .  How¬ 
ever,  there  can  be  cases  when  the  bound  on  the  number  of  possible  iterations  of  a  loop  can  be 
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obtained  from  VSA.  In  such  cases,  the  information  about  a  loop’s  iteration  bounds  would  provide 
users  with  more  precise  information  about  the  output  format. 

More  details  can  be  found  in  the  paper  about  FFE/x86  [127]. 

2.3  ConSeq 

CodeSurfer/x86  has  also  been  used  as  a  component  of  a  consequence-oriented  backward- 
analysis  framework,  called  ConSeq,7  to  detect  concurrency  bugs  [191].  This  section  summarizes 
ConSeq,  and  describes  the  component  for  static  slicing  (§2.3.1),  which  was  my  contribution  to  the 
work. 

Concurrency  bugs  are  caused  by  non-deterministic  interleavings  between  shared  memory  ac¬ 
cesses.  They  exist  widely  (e.g.,  20%  of  driver  bugs  examined  in  a  previous  study  [162]  are  con¬ 
currency  bugs)  and  are  among  the  most  difficult  bugs  to  detect  and  diagnose  because  interleavings 
are  not  only  complicated  to  reason  about,  but  they  also  dramatically  increase  the  state  space  of 
software.  For  large  real-world  applications,  each  input  easily  maps  to  billions  of  execution  inter¬ 
leavings,  and  a  concurrency  bug  may  only  be  exposed  by  one  specific  interleaving.  How  to  analyze 
this  huge  space  selectively  and  expose  hidden  bugs  is  an  open  problem  for  static  analysis,  model 
checking,  and  software  testing. 

The  effects  of  a  bug  propagate  through  data  and  control  dependences  until  they  cause  software 
to  crash,  hang,  produce  incorrect  output,  etc.  The  lifecycle  of  a  bug  thus  consists  of  three  phases: 
(1)  triggering,  (2)  propagation,  and  (3)  failure.  Traditional  techniques  for  detecting  concurrency 
bugs  mostly  focus  on  phase  (1) — i.e.,  on  finding  certain  structural  patterns  of  interleavings  that  are 
common  triggers  of  concurrency  bugs.  These  patterns  include  data  races  (conflicting  accesses  to  a 
shared  variable)  [66,  87,  147,  163,  189],  simple  atomicity  violations  (unserializable  interleavings 
of  two  small  code  regions)  [132,  151,  177,  188],  context- switch  bounded  interleavings  [56,  121, 
142,  143],  etc.  Although  much  progress  has  been  made  in  this  direction,  those  techniques  have 

'  ConSeq  was  carried  out  in  collaboration  primarily  with  W.  Zhang,  S.  Lu,  and  T.  Reps,  along  with  R.  Olichan- 
dran,  J.  Scherpelz,  and  G.  Jin.  My  contribution  to  the  work  consisted  of  the  development  of  the  component  for  static 
slicing. 
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Figure  2.15  The  common  three-phase  error-propagation  process  for  most  concurrency  bugs 

(obtained  from  [191]). 

fundamental  limitations  in  that  they  can  suffer  from  false  negatives  (i.e.,  many  of  common  real- 
world  concurrency  bugs  cannot  be  covered  by  traditional  patterns)  and  false  positives  (i.e.,  the 
reported  interleavings  are  not  always  truly  harmful). 

Consequence-Oriented  Approach 

To  improve  the  accuracy  and  coverage  of  state-space  search  and  bug  detection,  ConSeq  is 
based  on  a  consequence-oriented  approach — that  is,  it  uses  a  backwards  approach,  (3)— >(2)— >(1). 
ConSeq’s  backwards  approach  provides  advantages  in  bug-detection  coverage  and  accuracy  but  is 
challenging  to  carry  out.  ConSeq  makes  it  feasible  by  exploiting  the  empirical  observation  that 
phases  (2)  and  (3)  usually  are  short  and  occur  within  one  thread.  ConSeq  uses  potential  software 
failures  to  guide  its  search  of  the  interleaving  space.  Our  approach  can  be  divided  into  the  following 
three  stages: 

Stage  I.  ConSeq  first  statically  identifies  potential  failure  sites  in  an  executable  (i.e.,  it  first 
considers  a  phase  (3)  issue).  This  approach  is  based  on  the  observation  that  concurrency  and 
sequential  bugs  have  drastically  different  causes  but  have  mostly  similar  consequences. 

After  being  triggered  by  an  incorrect  execution  order  across  multiple  threads,  a  concurrency 
bug  usually  propagates  in  one  thread  through  a  short  data/control-dependence  chain,  similar  to 
one  for  a  sequential  bug  [97].  The  erroneous  internal  state  is  propagated  until  an  externally  visible 
failure  occurs.  At  the  end,  concurrency  and  sequential  bugs  are  almost  indistinguishable:  no  matter 
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what  the  cause,  a  crash  is  often  preceded  by  a  thread  touching  an  invalid  memory  location  or 
violating  an  assertion;  a  hung  thread  is  often  caused  by  an  infinite  loop;  incorrect  outputs  are 
emitted  by  one  thread,  etc. 

ConSeq  statically  identifies  five  types  of  potential  error  sites  that  cover  almost  all  major  types 
of  concurrency  bug  failures  (Stage  I  of  ConSeq  as  shown  in  Fig.  2.15):  (1)  calls  to  assertions  in  the 
software  (for  assertion  crashes);  (2)  back-edges  in  loops  (for  infinite  loop  hangs);  (3)  calls  to  output 
functions  (for  incorrect  functionality  failures),  (4)  calls  to  error-message  functions  in  the  software 
(for  various  types  of  internal  errors);  and  (5)  reads  on  global  variables  where  important  invariants 
likely  hold  according  to  Daikon  [85],  a  tool  for  inferring  program  invariants  (for  miscellaneous 
errors  and  failures). 

Stage  II.  ConSeq  then  uses  static  program  slicing  from  CodeSurfer/x86  to  identify  critical 
shared-memory  read  instructions  that  are  highly  likely  to  affect  potential  failure  sites  through  a 
short  chain  of  control  and  data  dependences  (phase  (2))  (Stage  II  of  ConSeq  in  Fig.  2.15). 

ConSeq  exploits  two  characteristics  of  concurrency  bugs:  first,  the  error-propagation  distance 
is  usually  short  in  terms  of  data/control-dependence  edges  [97]  (more  information,  including  val¬ 
idation  of  the  short-propagation  heuristic  can  be  found  in  [191]);  second,  the  cause  of  a  con¬ 
currency  bug  usually  involves  a  specific  ordering  of  just  a  few  (two  or  three)  shared  memory 
accesses  [56,  131]. 

§2.3.1  presents  the  details  of  Stage  II. 

Stage  III.  Finally,  ConSeq  monitors  a  single  (correct)  execution  of  a  concurrent  program,  and 
by  using  execution-trace  analysis  and  perturbation-based  interleaving  testing,  it  identifies  suspi¬ 
cious  interleavings  that  could  cause  an  incorrect  state  to  arise  at  a  critical  read  and  then  lead  to  a 
software  failure  (phase  (1))  (Stage  III  of  ConSeq  in  Fig.  2.15). 
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ConSeq  Modules 

As  shown  in  Fig.  2.16,  ConSeq  uses  a  combination  of  static  and  dynamic  analyses.  It  uses 
the  following  modules  to  create  an  analyzer  that  works  backwards  along  potential  bug-propagation 
chains. 

Error-site  identifier.  This  static-analysis  component  processes  an  executable  and  identifies  in¬ 
structions  where  certain  errors  might  occur.  For  example,  a  call  to  __assert_f  ail  is  a  potential 
assertion- violation  failure  site.  Although  currently  ConSeq  identifies  potential  error  sites  for  five 
types  of  errors,  developers  can  adjust  the  bug-detection  coverage  and  performance  of  ConSeq  by 
specifying  specific  types  of  error  sites  on  which  to  focus. 

Critical-read  identifier.  This  component  uses  static  slicing  to  find  out  which  instructions  that 
read  shared  memory  are  likely  to  impact  a  potential  error  site.  Note  that  static  analysis  is  usually  not 
scalable  for  multi-threaded  C/C++  programs.  By  leveraging  the  short-propagation  characteristic 
of  concurrency  bugs  and  the  staged  design  of  ConSeq,  this  module  is  scalable  to  large  C/C++ 
programs.  (§2.3.1  presents  more  details  of  this  module.) 

Suspicious-interleaving  finder.  This  dynamic-analysis  module  monitors  one  run  of  the  concur¬ 
rent  program,  which  is  usually  a  correct  run,  and  analyzes  what  alternative  interleavings  could 
cause  a  critical  read  to  acquire  a  different  and  potentially  dangerous  value.  By  leveraging  the  char¬ 
acteristics  of  concurrency  bugs’  root  causes,  this  module  is  effective  for  large  applications.  Via 
this  module,  ConSeq  generates  a  bug  report,  which  provides  a  list  of  critical  reads  that  can  po¬ 
tentially  read  dangerous  writes  and  lead  to  software  failures.  Critical  reads,  dangerous  writes,  and 
the  potential  failure  sites  are  represented  by  their  respective  program  counters  in  the  bug  report. 
Additionally,  the  stack  contents  are  provided  to  facilitate  programmers’  understanding  of  the  bug 
report.  [191]  presents  more  details. 

Suspicious-interleaving  tester.  This  module  tries  out  the  detected  suspicious  interleavings  by 
perturbing  the  program’s  re-execution.  It  helps  expose  concurrency  bugs  and  thereby  improves 
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Figure  2.16  An  overview  of  the  ConSeq  architecture  (obtained  from  [191]). 


programmers’  confidence  in  their  program.  Via  this  module,  ConSeq  prunes  false  positives  from 
the  bug  report,  and  extends  the  report  of  each  true  bug  with  how  to  perturb  the  execution  and  make 
the  bug  manifest.  See  [191]  for  more  details. 

2.3.1  Program  Slicing  in  ConSeq 

Program  slicing  is  an  operation  that  identifies  semantically  meaningful  decompositions  of  pro¬ 
grams,  where  the  decompositions  may  consist  of  elements  that  are  not  textually  contiguous  [183]. 
A  backward  slice  of  a  program  with  respect  to  a  set  of  program  elements  S  consists  of  all  pro¬ 
gram  elements  that  might  affect  (either  directly  or  transitively)  the  values  of  the  variables  used  at 
members  of  S.  Slicing  is  typically  carried  out  using  program  dependence  graphs  [103]. 


CodeSurfer/x86.  ConSeq  uses  backward  slicing  to  identify  shared  memory  reads  that  might 
impact  each  potential  error  site.  To  obtain  the  backward  slice  for  each  potential  error  site,  it  uses 
CodeSurfer/x86  [39],  which  is  a  static-analysis  framework  for  analyzing  the  properties  of  x86 
executables.  Various  analysis  techniques  are  incorporated  in  CodeSurfer/x86,  including  ones  to 
recover  a  sound  approximation  to  an  executable’s  variables  and  dynamically  allocated  memory 
objects  [41].  CodeSurfer/x86  tracks  the  flow  of  values  through  these  objects,  which  allows  it  to 
provide  information  about  control/data  dependences  transmitted  via  memory  loads  and  stores. 

The  goal  of  the  critical-read  identification  module  is  to  identify  critical-read  instructions  that 
are  likely  to  impact  potential  error  sites  through  data/control  dependences.  It  uses  static  slicing  to 
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approximate  (in  reverse)  the  second  propagation  phase  of  a  concurrency  bug,  as  shown  in  Fig.  2.15. 
The  major  design  principle  of  this  module  is  to  only  report  instructions  with  short  propagation 
distances  as  critical  reads.  Computing  the  complete  program  slice,  e.g.,  all  the  way  back  to  an 
input,  is  complicated  and  also  unnecessary  for  ConSeq.  ConSeq  leverages  the  short-propagation 
characteristic  of  concurrency  bugs  to  improve  bug-detection  efficiency  and  accuracy. 
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Figure  2.17  Static  slicing  (right)  and  the  distance  calculation  (left;  obtained  from  [191]). 


In  accordance  with  the  short-propagation  heuristic,  ConSeq  only  reports  read  instructions 
whose  return  values  can  affect  the  error  sites  through  a  short  sequence  of  data/control  depen¬ 
dences.  Our  static-slicing  tool  provides  the  slice,  together  with  the  value  of  the  shortest  distance  to 
the  starting  point  of  the  slice,  for  each  instruction  of  the  slice.  An  example  is  shown  in  Fig.  2.17. 
ConSeq  provides  a  tunable  threshold  MaxDistance  for  users  to  control  the  balance  between  false 
negatives  and  false  positives.  By  default,  ConSeq  uses  4  as  MaxDistance.  A  detailed  evaluation  is 
presented  in  [191]. 


Side-stepping  scalability  problems.  To  avoid  the  possible  scalability  problems  that  can  occur 
with  CodeSurfer/x86  due  to  the  size  of  the  applications  used  in  evaluating  ConSeq,  we  set  the 
starting  point  of  each  analysis  in  CodeSurfer/x86  to  the  entry  point  of  the  function  to  which  a 
given  potential  error  site  belongs,  instead  of  the  main  entry  point  of  the  program.  By  doing  so, 
CodeSurfer/x86  only  needs  to  analyze  the  functions  of  interest  and  their  transitive  calls  rather 
than  the  whole  executable.  Thus  the  static-analysis  time  grows  roughly  linearly  in  the  number  of 
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functions  that  contain  error  sites.  This  approach  makes  ConSeq  much  more  scalable,  as  illustrated 
in  the  experimental  section  of  [191]. 

This  approach  is  applicable  in  ConSeq  because — based  on  the  observation  that  the  error- 
propagation  distance  is  usually  short — ConSeq  only  requires  a  short  backward  slice  that  can  be 
covered  in  one  procedure.  The  backward-slicing  and  other  analysis  operations  in  CodeSurfer/x86 
are,  however,  still  context-sensitive  and  interprocedural  [103].  Moreover,  to  obtain  better  preci¬ 
sion  from  slices,  each  of  the  analyses  used  by  CodeSurfer/x86  is  also  performed  interprocedurally: 
calls  to  a  sub-procedure  are  analyzed  with  the  (abstract)  arguments  that  arise  at  the  call-site;  calls 
are  not  treated  as  setting  all  the  program  elements  to  T. 

Analysis  Accuracy.  To  obtain  static-analysis  results  that  over- approximate  what  can  occur  in  any 
execution  run,  all  the  program  elements  (memory,  registers ,  and  flags)  in  the  initial  state  with  which 
each  analysis  starts  are  initialized  to  T,  which  represents  any  value.  Such  an  approximation  makes 
sure  that  no  critical  read  will  be  missed  by  ConSeq  at  runtime.  Of  course,  some  instructions  could 
be  mistakenly  included  in  the  backward  slice  and  be  wrongly  treated  as  critical  reads.  Fortunately, 
our  short-propagation-distance  heuristic  minimizes  the  negative  impact  of  over-approximation.  In 
practice,  we  seldom  observe  any  inaccuracy  caused  by  this  over-approximation. 

Identifying  Potential  Infinite  Loop.  For  non-deadlock  bugs,  infinite  loops  in  one  thread  are  the 
main  causes  of  hangs.  Every  back-edge  in  a  loop  is  a  potential  site  for  this  type  of  failure.  ConSeq 
identifies  strongly  connected  components  (SCCs)  that  are  potential  failure  sites  for  infinite-loop 
hangs  by  checking  whether  any  shared-memory  read  is  included  in  the  backward  slice  of  each 
back-edge  in  an  SCC.  To  identify  nested  loops,  CodeSurfer/x86  implements  Bourdoncle’s  algo¬ 
rithm  [53],  which  recursively  decomposes  an  SCC  into  sub-SCCs,  etc. 

More  False-Positive  Pruning  via  Symbolic  Execution.  The  precision  loss  due  to  the  properties 
of  static  analysis  can  result  in  spurious  backward  slices,  which  can  cause  false  positives  in  ConSeq. 
To  prune  slices  that  are  likely  to  be  spurious,  we  introduce  a  heuristic  based  on  symbolic  execution , 
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which  tracks  symbolic  expressions  rather  than  actual  values  [62].  A  symbolic  execution  is  done 
by  replaying  a  concrete  trace  produced  from  PIN  [133],  but  executing  it  symbolically.  Each  trace 
must  contain  (i)  a  possibly  false-positive  critical  read  /  and  (ii)  the  control  point  B  (conditional 
branch  instructions)  that  controls  execution  of  an  error  site.  Two  separate  symbolic  executions  are 
performed  for  pruning:  one  with  I  (SE i),  and  the  other  without  I  ( SE2 ).  Each  of  the  program 
elements  is  initialized  to  a  symbol  instead  of  a  concrete  value  in  the  initial  symbolic  state  with 
which  each  symbolic  execution  starts.  We  obtain  the  branching  constraint  Cj  from  SE i,  and  the 
second  constraint  C2  from  SE2.  If  the  following  formula  always  holds,  we  can  determine  that  /  is 
a  false  positive  (i.e.,  /  does  not  impact  the  control  toward  the  error  site): 

Ci  C2. 

Due  to  the  complexity  of  validity  checking,  we  use  the  following  formula  as  a  heuristic: 

Sx  h  C2  ^  S2  h  Cx 

where  S\  and  S2  are  satisfying  assignments  obtained  using  the  YICES  SMT  solver  for  6)  and  C2, 
respectively.8 

2.3.2  Evaluation 

The  evaluation  of  ConSeq  on  large,  real-world  C/C++  applications  shows  that  ConSeq  detects 
more  bugs  than  traditional  approaches  and  has  a  much  lower  false-positive  rate  [191].  ConSeq  was 
evaluated  on  1 1  real-world  concurrency  bugs  in  seven  widely  used  C/C++  open-source  server  and 
client  applications — Mozilla,  MySQL,  Cherokee,  Transmission,  Aget,  etc.  ConSeq  was  able  to 
detect  10  out  of  1 1  tested  concurrency  bugs,  which  cover  a  wide  range  of  root  causes,  from  simple 
races  and  single-variable  atomicity-violations  to  order-violations,  anti-atomicity  violation  bugs, 
multi- variable  synchronization  problems,  etc.  For  comparison,  we  evaluated  a  race  detector  and 
an  atomicity-violation  detector  and  found  that  they  could  only  detect  3  and  4  bugs,  respectively. 

8For  the  implementation  of  this  particular  part,  we  use  symbol-analysis  primitives  (symbolic-execution  primitive 
and  satisfaction  relation)  created  by  TSL  [126,  125],  which  is  the  main  subject  of  this  thesis.  TSL  will  be  presented 
in  the  following  chapters. 
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ConSeq  detected  these  bugs  with  high  accuracy:  it  had  about  one-tenth  the  false-positive  rate  of 
the  race  detector  and  the  atomicity-violation  detector. 

ConSeq  also  found  2  new  bugs  in  Aget,  2  new  bugs  in  Click,  and  one  output  non-determinism 
in  Cherokee,  for  which  bugs  had  not  been  previously  reported.  ConSeq  found  a  known  infinite- 
loop  bug  in  a  version  of  MySQL  for  which  the  bug  had  not  been  previously  reported.  Experiments 
in  which  we  used  ConSeq  together  with  Daikon  [85]  show  that  ConSeq  can  detect  complicated 
concurrency  bugs  that  previous  tools  cannot  (e.g.,  a  bug  involving  11  threads  and  21  shared  vari¬ 
ables).  The  performance  of  ConSeq  is  suitable  for  in-house  testing. 

More  details  of  the  experimental  results  are  presented  in  the  ASPLOS’  1 1  paper  about  ConSeq 
[191]. 

2.3.3  Discussion  of  ConSeq 

The  work  on  ConSeq  provides  a  new  perspective  on  concurrency-bug  detection  and  testing, 
which  is  to  start  from  potential  consequences  and  work  backwards.  It  provides  alternative  inter¬ 
pretations  for  some  concurrency  bugs  with  complicated  causes  that  are  difficult  to  detect  using 
traditional  approaches,  and  sets  up  a  nice  connection  with  sequential  bug-detection  research,  such 
as  Daikon  [85]. 

ConSeq  uses  a  three-stage  bug-detection  framework  that  leverages  characteristics  from  all 
three  phases  of  the  concurrency-bug  propagation  process.  The  design  separates  the  complexity 
of  inter- thread  interleaving  analysis  and  intra-thread  propagation  analysis,  and  makes  it  easy  to 
leverage  advanced  static-analysis  techniques,  such  as  slicing  and  loop  analysis.  Each  stage  of  the 
framework  can  be  easily  extended.  In  particular,  programmers  can  assist  ConSeq  by  putting  more 
consistency  checks  into  their  code,  such  as  assertions  and  error  messages. 

Overall,  ConSeq  effectively  exposes  those  non-determinisms  among  a  small  number  of  shared 
memory  accesses  that  can  propagate  a  relatively  short  distance  and  cause  a  common  error  (such  as 
infinite  loop,  error  message  firing,  assertion  failure,  etc.)  and  end  up  with  a  visible  failure. 
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2.4  Motivation  for  a  New  System  for  Implementing  Machine-Code  Analyses 

Although  the  analysis  techniques  incorporated  into  CodeSurfer/x86,  in  principle,  are  language- 
independent,  the  original  implementation  was  tied  to  the  Intel  IA32  instruction  set.  Moreover, 
CodeSurfer/x86  incorporated  at  least  eight  separate  analyses,  each  of  which  was  an  independently- 
coded  abstract  interpretation  of  the  IA32  instruction  set’s  concrete  semantics.  Fig.  2.18  shows 
some  simplified  versions  of  the  implementations  of  VS  A  (on  the  left)  and  AS  I  (on  the  right)  in 
CodeSurfer/x86.  The  implementation  of  the  abstract  transformer  for  each  analysis  usually  has  a 
big  switch  statement  where  for  each  instruction  of  IA32,  an  abstract  transformer  is  implemented 
in  the  analysis  abstract  domain  according  to  the  concrete  semantics  of  the  instruction.  The  switch 
statement  for  each  analysis  in  CodeSufer/x86  contains  about  110  cases9  for  frequently-used  IA32 
instructions.  If  one  wanted  to  develop,  e.g.,  CodeSurfer/PowerPC,  substantial  work  would  be  nec¬ 
essary  to  port  the  original  CodeSurfer/x86  implementation  to  support  a  new  instruction  set.  In 
particular,  CodeSurfer/x86  consists  of  eight  analyses,  and  an  abstract  transformer  for  each  instruc¬ 
tion  of  PowerPC  would  need  to  be  implemented  for  each  of  the  eight  analyses’  abstract  domains. 

In  general,  if  one  can  has  N  subject  languages  and  a  desired  tool  that  consists  of  M  analysis 
components,  one  would  have  to  create  N  x  M  analysis-component  implementations.  (One  of  the 
advantages  of  the  TSL  system  is  that  to  obtain  the  desired  N  x  M  analysis-component  implemen¬ 
tations,  a  human  tool  designer  will  only  have  to  perform  N  +  M  work.) 

The  situation  described  above  is  fairly  typical  of  much  work  on  program  analysis:  although  the 
techniques  described  in  the  literature  are,  in  principle,  language-independent,  implementations  are 
often  tied  to  a  specific  language  or  intermediate  representation  (IR).  Retargeting  them  to  another 
language  can  be  an  expensive  and  error-prone  process.  Even  for  source-code  analysis,  this  state  of 
affairs  reduces  the  impact  that  good  ideas  developed  in  one  context  (e.g.,  Java  program  analysis) 
have  in  other  contexts  (e.g.,  C++  analysis). 

For  high-level  languages,  the  situation  has  been  addressed  by  developing  common  intermediate 
languages,  e.g.,  GCC’s  RTL,  Microsoft’s  MSIL,  etc.  (although  the  academic  research  community 

9The  remaining  instructions  out  of  about  600  IA32  non-floating-point/non-MMX  instructions  are  treated  as  caus¬ 
ing  the  resultant  state  to  be  Top. 
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[1]  VSA_state_t  VsaTransformerForlA32(  [1]  set_of_mini_asiJnstr  AsiTransformerForlA32( 


[2] 

Instr  i,  VSA_state_t  S) 

[2] 

Instr  i,  VSA_state_t  S) 

[3]  { 

[3]  { 

[4] 

VSA_state_t  ans; 

[4] 

set_of  jninLasidnstr  ans; 

[5] 

switch(i.id)  { 

[5] 

switch(i.id)  { 

[6] 

case  IA32.MOV:  { 

[6] 

case  IA32_MOV:  { 

[7] 

VSA_value_t  v  =  EvalVSA(i.child2,  S); 

[7] 

set_of_mini_asi_instr  vl  = 

[8] 

ans  =  UpdateVSAState(S,  i.childl ,  v); 

[8] 

CollectMemAccesses(i.child1 ,  S); 

[9] 

break; 

[9] 

seLofjminLasLinstr  v2  = 

[10] 

} 

[10] 

CollectMemAccesses(i.child2,  S); 

[11] 

case  IA32.ADD:  { 

[11] 

ans  =  vl  .union(v2); 

[12] 

VSA_value_t  vl  =  EvalVSA(i.child1 ,  S); 

[12] 

break; 

[13] 

VSA_value_t  v2  =  EvalVSA(i.child2,  S); 

[13] 

} 

[14] 

VSA_value_t  v  =  VSAPIus(v1 ,  v2); 

[14] 

case  IA32_ADD:  { 

[15] 

ans  =  UpdateVSAState(S,  i.childl ,  v); 

[15] 

[16] 

break; 

[16] 

break; 

[17] 

} 

[17] 

} 

[18] 

case  IA32.SUB:  { 

[18] 

case  IA32_SUB:  { 

[19] 

[19] 

[20] 

break; 

[20] 

break; 

[21] 

} 

[21] 

} 

[22] 

} 

[22] 

} 

[23] 

return  ans; 

[23] 

return  ans; 

[24]} 

[24]} 

Figure  2.18  Two  snippets  of  VSA  and  ASI  implementations  in  CodeSurfer/x86;  EvaIVSA/ 
UpdateVSAState  and  CollectMemAccesses  are  other  IA32-specific  procedures  for  VSA  and  ASI, 
respectively;  ASI  makes  use  of  the  information  from  VSA. 


has  not  rallied  around  a  similar  common  platform).  The  situation  is  more  serious  for  low-level  in¬ 
struction  sets,  because  (i)  most  instruction  sets  have  evolved  over  time,  so  that  each  instruction-set 
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into  the  low -order  8  bits  of  rD.  Bits  8-15  of  the  word  in  memory  addressed  by  EA  are 
loaded  into  the  subsequent  low-order  8  bits  of  rD.  Bits  16-23  of  the  word  in  memory 
addressed  by  EA  are  loaded  into  the  subsequent  low-order  eight  bits  of  rD.  Bits  24-31  of 
the  word  in  memory  addressed  by  EA  are  loaded  into  the  subsequent  low-order  8  bits  of 
rD.  The  high-order  32  bits  of  rD  are  cleared. 

The  PowerPC  architecture  cautions  programmers  that  some  implementations  of  the 
architecture  may  run  the  Iw  brx  instructions  with  greater  latency  than  other  types  of  load 
instructions. 

Other  registers  altered: 

•  None 

Figure  2.19  The  description  of  the  PowerPC  instruction  Iwbrx  (obtained  from  the  PowerPC 

instruction-set  manual  [27]). 

family  has  a  bewildering  number  of  variants,10  which  has  led  to  instruction  sets  with  several  hun¬ 
dred  instructions,  and  (ii)  there  are  a  variety  of  architecture-specific  features  that  are  incompatible 
with  other  architectures. 

Fig.  2.19  shows  an  informal  description  of  the  operational  semantics  of  an  instruction  in  the 

32-bit  PowerPC  instruction  set.  One  can  imagine  how  expensive  and  error  prone  it  would  be 

to  develop  an  analysis  implementation  because  the  developer  needs  to  interpret  the  instruction’s 

concrete  semantics  in  the  abstract  domain  used  by  the  analysis. 

10For  a  brief  overview,  see  http://en.wikipedia.org/wiki/jX86,  ARM_architecture,PowerPC}.  In  particular,  the  arti¬ 
cle  about  ARM  lists  25  different  architectural  versions  [Sept.  29,  2008], 
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Our  motivation  is  to  provide  a  systematic  way  of  extending  the  analyses  used  in 
CodeSurfer/x86 — and  others — to  instruction  sets  other  than  IA32.  The  motivation  led  us  to  de¬ 
velop  a  meta-tool  (or  tool-generator),  called  TSL  (for  “Transformer  Specification  Language”),  to 
help  in  the  creation  of  tools  for  analyzing  machine  code.  TSL  consists  of  a  language  for  describing 
the  semantics  of  an  instruction  set,  along  with  a  run-time  system  to  support  the  static  analysis  of 
executables  written  in  that  instruction  set.  The  work  advances  the  state  of  the  art  by  creating  a 
system  for  automatically  generating  analysis  components  from  a  specification  of  the  language  to 
be  analyzed.  In  the  remaining  chapters,  we  introduce  TSL  and  describe  some  of  its  capabilities. 
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Chapter  3 

Transformer  Specification  Language 

In  Chapter  2,  we  discussed  the  importance  and  advantages  of  machine-code  analysis  and  chal¬ 
lenges  in  developing  a  system  for  analyzing  machine-code.  This  chapter  presents  the  TSL  system 
that  we  have  developed  to  address  the  challenging  issues  discussed  in  §2.4.  “TSL”  stands  for 
“Transformer  Specification  Language”,  and  is  used  both  for  the  name  of  the  overall  system  and 
for  the  name  of  the  system’s  meta-language. 

Design  Principles 

In  designing  TSL,  we  were  guided  by  the  following  principles: 

•  There  should  be  a  formal  language  for  specifying  the  semantics  of  the  language  to  be  an¬ 
alyzed.  Moreover,  an  instruction-set-semantics  developer  should  specify  only  the  abstract 
syntax  and  a  concrete  operational  semantics  of  the  language  to  be  analyzed — each  analyzer 
should  be  generated  automatically  from  this  specification. 

•  Concrete  syntactic  issues — including  (i)  decoding  (machine  code  to  abstract  syntax),  (ii) 
encoding  (abstract  syntax  to  machine  code),  (iii)  parsing  assembly  (assembly  code  to  abstract 
syntax),  and  (iv)  assembly  pretty-printing  (abstract  syntax  to  assembly  code) — should  be 
handled  separately  from  the  abstract  syntax  and  concrete  semantics.1 

1  The  translation  of  the  concrete  syntaxes  to  and  from  abstract  syntax  is  handled  by  a  generator  tool  that  is  separate 
from  TSL,  and  will  not  be  discussed  in  this  thesis.  The  relationship  between  the  two  systems  is  similar  to  that  between 
Flex  and  Bison.  With  Flex  and  Bison,  a  Flex-generated  lexer  passes  tokens  to  a  Bison-generated  parser.  In  our  case, 
the  TSL-defined  abstract  syntax  serves  as  the  formalism  for  communicating  values — namely,  instructions’  abstract 
syntax  trees — between  the  two  tools. 
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•  There  should  be  a  clean  interface  for  analysis  developers  to  specify  the  abstract  semantics 
for  each  analysis.  An  abstract  semantics  consists  of  an  interpretation :  an  abstract  domain 
and  a  set  of  abstract  operators  (i.e.,  for  the  operations  of  TSL). 

•  The  abstract  semantics  for  each  analysis  should  be  separated  from  the  languages  to  be  an¬ 
alyzed  so  that  one  does  not  need  to  specify  multiple  versions  of  an  abstract  semantics  for 
multiple  languages. 

Each  of  these  objectives  has  been  achieved  in  the  TSL  system:  The  TSL  system  translates  the 
TSL  specification  of  each  instruction  set  to  a  common  intermediate  representation  (Cl  R)  that  can 
be  used  to  create  multiple  analyzers  (§3.1).  Each  analyzer  is  specified  at  the  level  of  the  meta¬ 
language  (i.e.,  by  reinterpreting  the  operations  of  TSL),  which — by  extension  to  TSL  expressions 
and  functions — provides  the  desired  reinterpretation  of  the  instructions  of  an  instruction  set  (§3.3). 

Other  notable  aspects  of  our  work  include 

•  Support  for  Multiple  Analysis  Types.  The  system  supports  several  analysis  types: 

-  Classical  worklist-based  value-propagation  analyses. 

-  Transformer-composition-based  analyses  [74,  169],  which  are  particularly  useful  for 
context-sensitive  interprocedural  analysis,  and  for  relational  analyses. 

-  Unification-based  analyses  for  flow-insensitive  interprocedural  analysis. 

In  addition,  an  emulator  (for  the  concrete  semantics)  is  also  created. 

•  Implemented  Analyses.  These  mechanisms  have  been  instantiated  for  a  number  of  specific 
analyses  that  are  useful  for  analyzing  low-level  code,  including  value-set  analysis  [38,  41] 
(§3.3.1),  affine-relation  analysis  [38,  §7.2]  (§3.3.2),  def-use  analysis  (for  memory,  registers, 
and  flags)  (§3.3.3),  aggregate  structure  identification  [42]  (§3.3.4),  and  generation  of  sym¬ 
bolic  expressions  for  an  instruction’s  semantics  (§3.3.5). 

•  Established  Applicability.  The  capabilities  of  our  approach  have  been  demonstrated  by  writ¬ 
ing  specifications  for  IA32  and  PowerPC.  These  are  nearly  complete  specifications  of  the  in¬ 
teger  subset  of  these  languages,  and  include  such  features  as  (1)  aliasing  among  8-,  16-,  and 
32-bit  registers,  e.g.,  al,  ah,  ax,  and  eax  (for  IA32),  (2)  endianness,  (3)  issues  arising  due 
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to  bounded-word-size  arithmetic  (overflow/underflow,  carry/borrow,  shifting,  rotation,  etc.), 
and  (4)  setting  of  condition  codes  (and  their  subsequent  interpretation  at  jump  instructions). 

The  TSL-generated  analysis  components  for  IA32  and  PowerPC  have  been  put  together  to  create  a 
system  that  essentially  duplicates  CodeSurfer/x86  [5]  and  creates  CodeSurfer/ppc32,  respectively. 
We  have  also  experimented  with  sufficiently  complex  features  of  other  low-level  languages  (e.g., 
register  windows  for  Sun  SPARC  and  conditional  execution  of  instructions  for  ARM)  to  know  that 
they  fit  our  specification  and  implementation  models. 

The  remainder  of  this  chapter  is  organized  as  follows:  §3.1  presents  the  overview  of  the  TSL 
system  both  from  the  perspective  of  instruction-set  specifiers  (ISS)  (§3.1.1)  and  that  of  analysis 
developers  (§3.1.2).  The  section  also  discusses  quirky  features  of  several  instruction  sets,  and  dis¬ 
cusses  how  those  features  are  handled  in  TSL.  §3.2  discusses  how  the  TSL  compiler  generates  a 
CIR  from  a  TSL  specification  and  how  the  CIR  is  used  for  creating  analysis  components.  The 
section  also  describes  how  the  TSL  system  handles  some  important  issues,  such  as  recursion  and 
conditional  branches  in  the  CIR.  §3.3  presents  several  analysis  components  that  have  been  instan¬ 
tiated  for  developing  a  system  for  analyzing  low-level  code.  §3.4  discusses  the  measure  of  success 
and  the  leverage  that  the  TSL  system  provides.  §3.5  discusses  related  work. 

3.1  Overview  of  the  TSL  System 

The  key  principle  of  the  TSL  system  is  the  separation  of  the  semantics  of  a  subject  language 
from  the  analysis  semantics  in  the  development  of  an  analysis  component.  As  discussed  in  §1.4.1, 
the  TSL  system  is  based  on  semantic  reinterpretation,  which  was  originally  proposed  as  a  conve¬ 
nient  methodology  for  formulating  abstract  interpretations  [73,  110,  134,  144,  148]  (see  §1.4.1). 
Semantic  reinterpretation  involves  refactoring  the  specification  of  the  concrete  semantics  of  a  lan¬ 
guage  into  two  parts:  (i)  a  client  specification,  and  (ii)  a  semantic  core.  The  interface  to  the  core 
consists  of  certain  basetypes,  function  types,  and  operators  (sometimes  called  a  semantic  algebra 
[140]).  The  client  is  expressed  in  terms  of  this  interface.  Such  an  organization  permits  the  core  to 
be  reinterpreted  to  produce  an  alternative  semantics  for  the  subject  language. 
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The  key  insight  behind  the  TSL  system  is  that  if  a  rich  enough  meta-language  is  provided  for 
writing  semantic  specifications,  one  can  avoid  the  ad  hoc  refactoring  step.  The  advantage  of  this 
approach  is  that  it  allows  the  TSL  system  to  act  as  a  “Y ACC-like”  tool  for  generating  analysis 
components  from  a  semantic  description  of  an  instruction  set. 

Client  Analyzer 


M  Instruction-Set  Specifications 

Figure  3.1  The  interaction  between  the  TSL  system  and  a  client  analyzer.  The  grey  boxes 
represent  TSL-generated  analysis  components. 

The  TSL  system  has  two  classes  of  users:  (1)  instruction-set  specifiers  (ISS)  and  (2)  analysis 
developers.  The  former  use  the  TSL  language  to  specify  the  concrete  semantics  of  different  instruc¬ 
tion  sets  (the  lower  part  of  Fig.  3.1);  the  latter  use  semantic  reinterpretation  to  create  new  analyses 
(the  upper  part  of  Fig.  3.1).  §3.1.1  and  §3.1.2  present  the  TSL  system  from  an  instruction-set 
specifier’s  standpoint  and  an  analysis  developer’s  standpoint,  respectively. 

3.1.1  TSL  from  an  ISS’s  Standpoint 

Fig.  3.2  shows  part  of  a  specification  of  the  IA32  instruction  set  taken  from  the  Intel  manual 
[17].  The  specification  describes  the  syntax  and  the  semantics  of  each  instruction  only  in  a  semi- 
formal  way  (i.e.,  a  mixture  of  English  and  pseudo-code). 

Our  work  is  based  on  completely  formal  specifications  that  are  written  in  a  language  that  we  de¬ 
signed  (TSL).  TSL  is  a  strongly  typed,  first-order  functional  language.  TSL  supports  a  fixed  set  of 
base-types;  a  fixed  set  of  arithmentic,  bitwise,  relational,  and  logical  operators;  the  ability  to  define 
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General  Purpose  Registers: 

ADD  r/m32,r32;  Add  r32  to  r/m32 

EAX,EBX,ECX,EDX,ESP,EBP,ESI,EDI,EIP 

ADD  r/ml6,rl6;  Add  rl6  to  r/ml6  .  .  . 

Each  of  these  registers  also  has  16-  or  8-bit  subset  names. 

Operation:  DEST  <-  DEST  +  SRC; 

Addressing  Modes:  [sreg:] [offset] [([base] [.index] [, scale])] 

Flags  Affected:  The  OF,SF,ZF,AF,CF,  and 

EFLAGS  register:  ZF,SF,OF,CF,AF,PF,  .  .  . 

PF  flags  are  set  according  to  the  result. 

Figure  3.2  A  part  of  the  Intel  manual’s  specification  of  IA32’s  add  instruction. 


recursive  data-types,  map-types,  and  user-defined  functions;  and  a  mechanism  for  deconstruction 
by  means  of  pattern  matching. 

Basetypes.  Fig.  A.  1  shows  the  basetypes  that  TSL  provides.  There  are  two  categories  of  primitive 
base-types:  unparameterized  and  parameterized.  An  unparameterized  base-type  is  just  a  set  of 
terms.  For  example,  BOOL  is  a  type  consisting  of  truth  values,  INT32  is  a  type  consisting  of 
32-bit  signed  whole  numbers,  etc.  MAP[o:,  (3\  is  a  predefined  parameterized  type,  with  parameters 
a  and  (3.  Each  of  the  following  is  an  instance  of  the  parameterized  type  MAP: 

MAP[INT32,INT8] 

MAP  [INT32,  BOOL] 

MAP  [INT32  ,MAP  [INT8 ,  BOOL]  ] 

TSL  supports  arithmetic/logical  operators  (+,  — ,  *,  /,  !,  &&,  ||,  xor),  bit-manipulation  opera¬ 
tors  (~,  &,  |,  ",  <C,  3>,  right-rotate,  left-rotate),  relational  operators  (<,  <=,  >,  >=,  ==,  !=),  and 
a  conditional-expression  operator  (?  :).  TSL  also  provides  access/update  operators  for  map-types. 
More  details  of  the  TSL  syntax  and  semantics  can  be  found  in  Appendix  A. 

Specifying  an  Instruction  Set.  Fig.  3.4(a)  shows  a  snippet  of  the  TSL  specification  that  cor¬ 
responds  to  Fig.  3. 2. 2  Much  of  what  an  instruction-set  specifier  writes  in  a  TSL  specification  is 
similar  to  writing  an  interpreter  for  an  instruction  set  in  first-order  ML  [99].  One  specifies  (i)  the 
abstract-syntax  grammar  of  the  instruction-set,  (ii)  a  type  for  concrete  states,  and  (iii)  the  concrete 

semantics  of  each  instruction. 

2The  TSL  specification  is  simplified  to  make  the  presentation  simpler. 
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Type 

Terms 

Constants 

BOOL 

false,  true 

false,  true 

INT64 

64-bit  signed  integers 

0d64,  1d64,  2d64, ... 

INT32 

32-bit  signed  integers 

0d32,  1d32,  2d32, ... 

INTI  6 

16-bit  signed  integers 

0d16,  1d16,  2d  16, ... 

INT8 

8 -bit  signed  integers 

0d8, 1d8,  2d8, ... 

STR 

Sequences  of  characters. 

All  characters  except 

’\000’  permitted. 

II  ti 

"ab...AB... 01. ..!%..." 

"\n\r\b\t\f\’\"\\" 

"\001  \002\003..." 

MAP[q,/5] 

Maps 

no  constants 

Figure  3.3  Syntax  of  constants  of  primitive  type. 

Reserved,  but  User- Defined  Types  and  Reserved  Functions.  Each  specification  must  define 
several  reserved  (but  user-defined)  types:  instruction  (lines  2-9  of  Fig.  3.4(a));  State — e.g.,  for 
32-bit  Intel  x86  the  type  state  is  a  triple  of  maps  (lines  10-12  of  Fig.  3.4(a));  as  well  as  the  reserved 
TSL  function  interplnstr  (lines  17-30  of  Fig.  3.4(a)).  These  reserved  types  and  functions  form  part 
of  the  API  available  to  analysis  engines  that  use  the  TSL-generated  transformers  (CIR). 

The  definition  of  types  and  constructors  on  lines  2-9  of  Fig.  3.4(a)  is  an  abstract-syntax  gram¬ 
mar  for  IA32.  Type  reg  consists  of  nullary  constructors  for  IA32  registers,  such  as  EAX()  and 
EBX();  flag  consists  of  nullary  constructors  for  the  IA32  condition  codes,  such  as  ZF()  and  SF(). 
Fines  4-6  define  types  and  constmctors  to  represent  the  various  kinds  of  operands  that  IA32  sup¬ 
ports,  i.e.,  various  sizes  of  immediate,  direct  register,  and  indirect  memory  operands.  The  reserved 
(but  user-defined)  type  instruction  consists  of  user-defined  constructors  for  each  instruction,  such 
as  MOV  and  ADD. 

The  type  State  specifies  the  structure  of  the  execution  state.  The  State  for  IA32  is  defined  on 
lines  10-12  of  Fig.  3.4(a)  to  consist  of  three  maps,  i.e.,  a  memory-map,  a  register-map,  and  a  flag- 
map.  The  concrete  semantics  is  specified  by  writing  a  function  named  interplnstr  (see  lines  17-30 
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[1]  //  User-defined  abstract  syntax 

[2]  reg:  EAX()  j  EBX()  |  .  .  .  ; 

[3]  flag:  ZF()  j  SF()  |  .  .  .  ; 

[4]  operand:  lndirect(reg  reg  INT8  INT32) 

[5]  |  DirectReg(reg) 

[6]  |  lmmediate(INT32)  | 

[7]  instruction 

[8]  :  MOV(operand  operand) 

[9]  |  ADD(operand  operand]  .  .  .  ; 

[10]  state:  State(MAP[INT32,INT8]  //memory-map 

[11]  MAP[reg32,INT32]  //register-map 

[12]  MAP[flag,BOOL]);  //flag-map 

[13]  // User-defined  functions 

[14]  INT32  interpOp(state  S,  operand  op)  {  .  .  . 

[15]  state  updateFlag(state  S, ...){...  }; 

[16]  state  updateState(state  S, 

[17]  state  interplnstr(instruction  I,  state  S)  { 

[18]  with(l)  ( 

[19]  MOV(dstOp,  srcOp): 

[20]  let  srcVal  =  interpOp(S,  srcOp); 

[21]  in  (  updateState(  S,  dstOp,  srcVal ) ), 

[22]  ADD(dstOp,  srcOp): 

[23]  let  dstVal  =  interpOp(S,  dstOp); 

[24]  srcVal  =  interpOp(S,  srcOp); 

[25]  res  =  dstVal  +  srcVal; 

[26]  S2  =  updateFlag(S,  dstVal,  srcVal,  res); 

[27]  in  (  updateState(  S2,  dstOp,  res  ) ), 

[28]  .  .  . 

[29]  ); 

[30]  }; 


[1]  template  <class  BT>  class  CIR  { 

[2]  class  reg  {  .  .  .  }; 

[3]  class  EAX  :  public  reg  . 

[4]  class  flag  {■■■}; 

[5]  class  ZF  :  public  flag 

[6]  class  operand  {...}; 

[7]  class  Indirect:  public  operand 

[8]  class  instruction  {...}; 

[9]  class  MOV  :  public  instruction  {  .  .  . 

[10]  operand  opl;  operand  op2; .  .  . 

[11]  }; 

[12]  class  MOV  :  public  instruction 

[13]  class  state  {...}; 

[14]  class  State:  public  state  {■■■}; 

[15]  BT::INT32  interpOp(state  S,  operand  op)  {  .  .  .  }; 

[16]  state  updateFlag(state  S, 

[17]  state  updateState(state  S, ...){...  }; 

[18]  state  interplnstr(instruction  I,  state  S)  { 

[19]  switch(l.id)  { 

[20]  case  IDJVIOV: .  .  . 

[21]  case  ID_ADD: 

[22]  operand  dstOp  =  l.geLchildl  (); 

[23]  operand  srcOp  =  I.get_child2(); 

[24]  BT::INT32  dstVal  =  interpOp(S,  dstOp); 

[25]  BT::INT32  srcVal  =  interpOp(S,  srcOp); 

[26]  BT::INT32  res  =  BT::Plus(dstVal,  srcVal); 

[27]  state  S2  =  updateFlag(S,  dstVal,  srcVal,  res); 

[28]  ans  =  updateState(  S2,  dstOp,  res  ); 

[29]  break; 

[30]  .  .  .  } 

[31]  }}; 


Figure  3.4  (a)  A  part  of  the  TSL  specification  of  IA32  concrete  semantics,  which  corresponds  to 
the  specification  of  add  from  the  IA32  manual.  Reserved  types  and  function  names  are 
underlined,  (b)  A  part  of  the  CIR  generated  from  (a);  The  CIR  is  simplified  in  this  presentation. 
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of  Fig.  3.4(a)),  which  maps  an  instruction  and  a  state  to  a  state.  For  instance,  the  semantics  of 
ADD  is  to  evaluate  the  two  operands  in  the  input  State  S  and  create  a  return  State  in  which  the 
target  location  holds  the  summation  of  the  two  values  and  the  flags  hold  appropriate  flag  values. 

3.1.1.1  Case  Study  of  Instruction  Sets 

In  this  section,  we  discuss  the  quirky  characteristics  of  some  instruction  sets,  and  various  ways 
these  can  be  handled  in  TSL. 

IA32.  To  provide  compatibility  with  16-bit  and  8-bit  versions  of  the  instruction  set,  IA32  provides 
overlapping  register  names,  such  as  AX  (the  lower  16-bits  of  EAX),  AL  (the  lower  8-bits  of  AX), 
and  AH  (the  upper  8-bits  of  AX).  There  are  two  possible  ways  to  specify  this  feature  in  TSL.  One  is 
to  keep  three  separate  maps,  for  32-bit  registers,  16-bit  registers,  and  8-bit  registers,  respectively, 
and  specify  that  updates  to  any  one  of  the  maps  affect  the  other  two  maps.  Another  is  to  keep  one 
32-bit  map  for  registers,  and  obtain  the  value  of  a  16-bit  or  8 -bit  register  by  masking  the  value  of 
the  32-bit  register.  (The  former  can  yield  more  precise  VSA  results.) 

Another  characteristic  to  note  is  that  IA32  keeps  condition  codes  in  a  special  register,  called 
EFLAGS.3  One  way  to  specify  this  feature  is  to  declare  “reg32:  Eflags() and  make  every  flag 
manipulation  fetch  the  bit  value  from  an  appropriate  bit  position  of  the  value  associated  with  Eflags 
in  the  register-map.  Another  way  is  to  have  symbolic  flags,  as  in  our  examples,  and  have  every 
manipulation  of  EFLAGS  affect  the  entries  in  a  flag-map  for  the  individual  flags. 

ARM.  Almost  all  ARM  instructions  contain  a  condition  field  that  allows  an  instruction  to  be 
executed  conditionally,  depending  on  condition-code  flags.  This  feature  reduces  branch  overhead 
and  compensates  for  the  lack  of  a  branch  predictor.  However,  it  may  worsen  the  precision  of  an 
abstract  analysis  because  in  most  instructions’  specifications,  the  abstract  values  from  two  arms  of 
a  TSL  conditional  expression  would  be  joined. 

3Many  other  instruction  sets,  such  as  SPARC,  PowerPC,  and  ARM,  also  use  a  special  register  to  store  condition 
codes. 
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[1]  MOVEQ {destReg,  srcOprnd ): 

[2]  let  cond  =  flagMap(EQ()); 

[3]  src  =  interpOperand(curState,  srcOprnd ); 

[4]  a  =  regMap [destReg  |->  src]; 

[5]  b  =  regMap; 

[6]  answer  =  cond  ?  a  :  b\ 

[7]  in  (  answer  ) 

Figure  3.5  An  example  of  the  specification  of  an  ARM  conditional-move  instruction  in  TSL. 

For  example,  MOVEQ  is  one  of  ARM’s  conditional  instmctions;  if  the  flag  EQ  is  true  when  the 
instruction  starts  executing,  it  executes  normally;  otherwise,  the  instruction  does  nothing.  Fig.  3.5 
shows  the  specification  of  the  instruction  in  TSL.  In  many  abstract  semantics,  the  conditional 
expression  “ cond  ?  a  :  b”  will  be  interpreted  as  a  join  of  the  original  register  map  b  and  the 
updated  map  a,  i.e.,  join(a,b).  Consequently,  destReg  would  receive  the  join  of  its  original  value 
and  src,  even  when  cond  is  known  to  have  a  definite  value  (TRUE  or  FALSE)  in  VSA  semantics. 
The  paired- semantics  mechanism  presented  in  §3.2.3  can  help  with  improving  the  precision  of 
analyzers  by  avoiding  joins.  When  the  CIR  is  instantiated  with  a  paired  semantics  of  VSAJNTERP 
and  DUAJNTERP,  and  the  VSA  value  of  cond  is  FALSE,  the  DUAJNTERP  value  for  answer 
gets  empty  def-  and  use- sets  because  the  true  branch  a  is  known  to  be  unreachable  according  to 
the  VSAJNTERP  value  of  cond  (instead  of  non-empty  sets  for  c/c/'s  and  uses  that  contain  all  the 
definitions  and  uses  in  destReg  and  srcOprnd ). 

SPARC.  SPARC  uses  register  windows  to  reduce  the  overhead  associated  with  saving  registers 
to  the  stack  during  a  conventional  function  call.  Each  window  has  8  in,  8  out,  8  local,  and  8  global 
registers.  Outs  become  ins  on  a  context  switch,  and  the  new  context  gets  a  new  set  of  out  and  local 
registers.  A  specific  platform  will  have  some  total  number  of  registers,  which  are  organized  as  a 
circular  buffer;  when  the  buffer  becames  full,  registers  are  spilled  to  the  stack  to  free  up  a  sufficient 
number  for  the  called  procedure.  Fig.  3.6  shows  a  way  to  accomodate  this  feature.  The  syntactic 
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[1]  reg32  :  Reg(INT8)  |  CWP()  |  .  . 

[2]  reg32  :  OutReg(INT8)  |  lnReg(INT8)  [...  . 

[3]  state:  State( .  .  .  ,  MAP[var32,INT32],  .  .  .  ); 

[4]  INT32  RegAccess(MAP[var32,INT32]  regmap,  reg32  r)  { 

[5]  let  cwp  =  regmap(CWP()); 

[6]  key  =  with(r)  ( 

[7]  OutReg(i): 

[8]  Reg(8+i+(16+cwp*1 6)%(NWIND0WS*1 6), 

[9]  InReg(i):  Reg(8+i+cwp*1 6), 

[10]  .  .  .  ); 

[11]  in  ( regmap(key) ) 

[12] } 

Figure  3.6  A  method  to  handle  the  SPARC  register  window  in  TSL. 


register  ( Out  Reg  (n)  or  InReg(n),  defined  on  line  2)  in  an  instruction  is  used  to  obtain  a  semantic 
register  (Reg(m),  defined  on  line  1,  where  m  represents  the  register’s  global  index),  which  is  the 
key  used  for  accesses  on  and  updates  to  the  register  map.  The  desired  index  of  the  semantic  register 
is  computed  from  the  index  of  the  syntactic  register,  the  value  of  CWP  (the  current  window  pointer) 
from  the  current  state,  and  the  platform-specific  value  N  WIN  DOWS  (lines  8-9). 

3.1.1.2  Common  Intermediate  Representation  (CIR) 

Fig.  3.4(b)  shows  part  of  the  common  intermediate  representation  (CIR)  generated  by  the  TSL 
compiler  from  Fig.  3.4(a).4  The  CIR  generated  for  a  given  TSL  specification  is  a  C++  template 
that  can  be  used  to  create  multiple  analysis  components  by  instantiating  the  template  with  different 
semantic  reinterpretations.  Each  generated  CIR  is  specific  to  a  given  instruction-set  specification, 
but  common  (whence  the  name  CIR)  across  generated  analyses. 


4This  CIR  has  been  simplified  for  the  presentation  in  the  thesis. 
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Each  generated  Cl  R  is  a  template  class  that  takes  as  input  class  BT  (standing  for  base-type  inter¬ 
pretation),  which  is  an  abstract  domain  for  an  analysis  (line  1  of  Fig.  3.4(b)).  The  user-defined  ab¬ 
stract  syntax  (lines  2-9  of  Fig.  3.4(a))  is  translated  to  a  set  of  C++  abstract-syntax  classes  (lines  2- 
12  of  Fig.  3.4(b)).  The  user-defined  types,  such  as  reg,  operand,  and  instruction,  are  translated 
to  abstract  C++  classes,  and  the  constructors,  such  as  EAXQ,  Indirect)., and  ADD(_,_),  are 
subclasses  of  the  appropriate  parent  abstract  C++  classes. 

Each  user-defined  function  is  translated  to  a  CIR  function  (lines  15-31  of  Fig.  3.4(b)).  Each 
TSL  basetype  and  basetype-operator  is  prepended  with  the  template  parameter  name  BT;  BT  is 
supplied  by  an  analysis  developer  for  the  analysis  of  interest.  The  with  expression  and  the  pattern 
matching  on  lines  18-22  of  Fig.  3.4(a)  are  translated  into  switch  statements  in  C++  (lines  19-30 
in  Fig.  3.4(b)). 

With-normalization.  The  TSL  front-end  performs  with-normalization,  which  transforms  all 
multi-level  with  expressions  to  use  only  one-level  patterns,  and  then  compiles  the  one-level  pat¬ 
tern  via  the  pattern-compilation  algorithm  developed  by  M.  Pettersson  [153,  178].  The  algorithm 
for  compiling  term  pattern-matching  for  functional  languages  is  inspired  by  finite  automata  theory. 
The  algorithm  avoids  duplicating  code  and  introducing  redundant  or  sub-optimal  discrimination 
tests  by  viewing  patterns  as  regular  expressions  and  optimizing  the  finite  automaton  that  is  built  to 
recognize  them. 

The  function  calls  for  obtaining  the  values  of  the  two  operands  (lines  23-24  in  Fig.  3.4(a)) 
correspond  to  the  C++  code  on  lines  22-25  in  Fig.  3.4(b).  The  TSL  basetype-operator  +  on  line  25 
in  Fig.  3.4(a)  is  translated  into  a  call  to  BT::Plus,  as  shown  on  line  26  in  Fig.  3.4(b).  The  function 
calls  for  updating  the  State  (lines  26-27  in  Fig.  3.4(a))  are  translated  into  C++  code  (lines  27-28 
in  Fig.  3.4(b)). 

§3.2  presents  more  details  as  to  how  CIR  is  generated  and  what  kind  of  facilities  CIR  provides 


for  creating  analysis  components. 
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3.1.2  Tsl  from  an  Analysis  Developer’s  Standpoint 

An  analysis  developer  creates  a  new  analysis  component  by  (i)  redefining  (in  C++)  the  TSL 
basetypes  (BOOL,  INT32,  INT8,  etc.),  and  (ii)  redefining  (in  C++)  the  primitive  operations  on 
basetypes  (+|[\JT32’  +INT8’  etc-)-  These  are  used  to  instantiate  the  CIR  template  by  passing  a 
class  of  basetypes  as  the  template  parameter.  This  implicitly  defines  an  alternative  interpretation 
of  each  expression  and  function  in  an  instruction-set’s  concrete  semantics  (including  interplnstr), 
and  thereby  yields  an  alternative  semantics  for  an  instruction  set  from  its  concrete  semantics. 

Tab.  3.1  shows  the  implementations  of  primitives  for  three  selected  analyses:  value-set  analy¬ 
sis  (VSA,  see  §3.3.1),  def-use  analysis  (DUA,  see  §3.3.3),  and  quantifier-free  bit-vector  semantics 
(QFBV,  see  §3.3.5).  Each  interpretation  defines  an  abstract  domain.  For  example,  line  3  of  each 
column  defines  the  abstract-domain  class  for  INT32:  ValueSet32,  UseSet,  and  QFBVTerm32.  To 
define  an  interpretation,  one  needs  to  define  42  basetype  operators,  most  of  which  have  four  vari¬ 
ants,  for  8-,  16-,  32-,  and  64-bit  integers,  as  well  as  12  map  access/update  operations.  Each  abstract 
domain  is  also  required  to  contain  a  set  of  reserved  functions,  such  as  join,  meet,  and  widen,  which 
forms  an  additional  part  of  the  API  available  to  analysis  engines  that  use  TSL-generated  transform¬ 
ers  (see  §3.3). 

Usage  of  TSL-Generated  Analysis  Components.  Fig.  3.7  shows  how  the  CIR  is  connected  to 
an  analysis  solver.  The  analysis  solver  in  Fig.  3.7  uses  classical  worklist-based  value  propagation 
in  which  the  TSL-generated  transformer  interplnstr  is  invoked  with  an  instruction  and  the  current 
State  S.  On  each  iteration  of  the  main  loop  of  the  solver,  changes  (new_S)  are  propagated  to 
successors/predecessors  (depending  on  propagation  direction).  §3.3  summarizes  three  kinds  of 
analysis  engines  including  worklist-based  value  propagation. 

Generated  Transformers.  Consider  the  instruction  “add  ebx,  eax”,  which  causes  the  sum  of 
the  values  of  the  32-bit  registers  ebx  and  eax  to  be  assigned  into  ebx.  When  Fig.  3.4(b)  is  instan¬ 
tiated  with  the  three  interpretations  from  Tab.  3.1,  lines  17-30  of  Fig.  3.4(a)  implement  the  three 
transformers  that  are  presented  (using  mathematical  notation)  in  Tab.  3.2. 
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Table  3.1  Parts  of  the  declarations  of  the  basetypes,  basetype-operators,  and  map-access/update 

functions  for  three  analyses. 


VSA 

DUA 

QFBV 

[1]  class  VSAJNTERP  { 

[1]  class  DUAJNTERP  { 

[1]  class  QFBVJNTERP  { 

[2]  //  basetype 

[2]  //  basetype 

[2]  //  basetype 

[3]  typedef  ValueSet32  INT32; 

[3]  typedef  UseSet  INT32; 

[3]  typedef  QFBVTerm32  INT32; 

[4]  ... 

[4]  ... 

[4]  ... 

[5]  //  basetype-operators 

[5]  //  basetype-operators 

[5]  //  basetype-operators 

[6]  INT32  Add(INT32  a,  INT32  b)  { 

[6]  INT32  Add(INT32  a,  INT32  b)  { 

[6]  INT32  Add(INT32  a,  INT32  b)  { 

[7]  return  a.addValueSet(b); 

[7]  return  a.Union(b); 

[7]  return  QFBVPIus32(a,  b); 

[8]  } 

[8]  } 

[8]  } 

[9]  ... 

[9]  ... 

[9]  ... 

[10]  // map-basetypes 

[10]  //map-basetypes 

[10]  //map-basetypes 

[11]  typedef  Dict<reg32,INT32> 

[11]  typedef  Dict<var32,INT32> 

[11]  typedef  QFBVArray 

[12]  REGMAP32; 

[12]  REGMAP32; 

[12]  REGMAP32; 

[13]  .  .  . 

[13]  .  .  . 

[13]  .  .  . 

[14]  //  map-access/update  functions 

[14]  //  map-access/update  functions 

[14]  //  map-access/update  functions 

[15]  INT32  MapAccess( 

[15]  INT32  MapAccess( 

[15]  INT32  MapAccess( 

[16]  REGMAP32  m,  reg32  k)  { 

[16]  REGMAP32  m,  reg32  k)  { 

[16]  REGMAP32  m,  reg32  k)  { 

[17]  return  m.Lookup(k); 

[17]  return  m.Lookup(k); 

[17]  return  QFBVArrayAccess(m,k); 

[18]  } 

[18]  } 

[18]  } 

[19]  REGMAP32 

[19]  REGMAP32 

[19]  REGMAP32 

[20]  MapUpdate(  REGMAP32  m, 

[20]  MapUpdate(  REGMAP32  m, 

[20]  MapUpdate(  REGMAP32  m, 

[21]  reg32  k,  INT32  v)  { 

[21]  reg32  k,  INT32  v)  { 

[21]  reg32  k,  INT32  v)  { 

[22]  return  m.lnsert(k,  v); 

[22]  return  m.lnsert(k,v); 

[22]  return  QFBVArrayllpdate(m,k,v); 

[23]  } 

[23]  } 

[23]  } 

[24]  .  .  . 

[24],  .  . 

[24]  .  .  . 

[25]]; 

[25]}; 

[25]}; 

Table  3.2  Transformers  generated  by  the  TSL  system. 


Analysis 

Generated  Transformers  for  “add  ebx,  eax" 

l.VSA 

AS.S[ebx  i  >  S(ebx)+^aS(eax)l  [ZF  ^  (S(ebx)+^aS(eax)  =  0 )][moreflag  updates] 

2.DUA 

[  ebx  {eax,  ebx},  ZF  i->  {eax,  ebx}, ...  1 

3.QFBV 

(ebx'  =  ebx+32eax)  a  (ZF'  aa  (ebx+32eax  =  0))  A  (SF'  <=>  (ebx+32eax<  0))  A  . . . 
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N  Analysis  Components 


Analysis!  while(worklist  *  {})  { 

select  an  edge  n  m  from  worklist 

new_S  =  i nterplnstr#(i nstr(n),  S) 


state*  interplnstr#(instruction  I,  state*  S)  • 
with(l) 

ADD(dstOp,  srcOp): 
let  dstVal  =  interpOp(S,  dstOp); 
srcVal  =  interpOp(S,  srcOp); 
res  =  dstVal  +*  srcVal; 
new_S  =  updateFlag(S,  dstVal,  srcVal,  res) 
n 

updateState(new_S,  dstOp,  res) 

).-) 


M  Instruction-Set  Specifications 


Figure  3.7  How  a  TSL-generated  analysis  component  (interplnstr1*)  is  invoked  in  a  solver  that 

uses  classical  worklist-based  value  propagation. 
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3.2  Various  Aspects  of  a  Common  Intermediate  Representation 

Given  a  TSL  specification  of  an  instruction  set,  the  TSL  system  generates  a  CIR  that  consists 
of  two  parts:  one  is  a  list  of  C++  classes  for  the  user-defined  abstract- syntax  grammar;  the  other 
is  a  list  of  C++  template  functions  for  the  user-defined  functions,  including  the  interface  function 
interplnstr.  The  C++  functions  are  generated  by  linearizing  the  TSL  specification,  in  evaluation 
order,  into  a  series  of  C++  statements  as  described  in  §3. 1.1. 2. 

However,  there  are  some  important  issues  that  need  to  be  properly  handled  for  the  resulting 
code  to  be  able  to  be  used  to  create  abstract  interpreters  for  an  instruction-set  specification.  In 
particular,  the  code  generated  for  each  transformer  must  be  able  to:  (i)  execute  over  abstract  states 
(§3.2.2),  (ii)  possibly  propagate  abstract  states  to  more  than  one  successor  in  a  conditional  expres¬ 
sion  (§3.2.2. 1),  (iii)  compare  abstract  states  and  terminate  abstract  execution  when  a  fixed  point  is 
reached  (§3. 2. 2. 2),  and  (iv)  apply  widening  operators,  if  necessary,  to  ensure  termination  (§3. 2. 2. 2). 

In  §3.2.1,  we  discuss  an  additional  issue  that  arises  in  CIR  generation,  which  is  important 
for  avoiding  loss  of  precision  for  some  generated  analyzers.  §3.2.3  presents  the  paired-semantics 
facility  that  the  TSL  system  provides. 

3.2.1  Two-Level  CIR 

The  examples  given  in  Fig.  3.4(b),  Fig.  3.10,  and  Fig.  3.11(b),  show  slightly  simplified 
versions  of  CIR  code.  The  TSL  system  actually  generates  CIR  code  in  which  all  the  base- 
types,  basetype-operators,  and  access/update  functions  are  appended  with  one  of  two  predefined 
namespaces  that  define  a  two-level  interpretation  [111,  149]:  CONCINTERP  for  concrete  interpre¬ 
tation  (i.e.,  interpretation  in  the  concrete  semantics),  and  ABSINTERP  for  abstract  interpretation. 
Either  CONCINTERP  or  ABSINTERP  would  replace  the  occurrences  of  BT  in  the  example  CIR 
shown  in  Fig.  3.4(b),  Fig.  3.10,  and  Fig.  3.11(b). 

The  reason  for  using  a  two-level  CIR  is  that  the  specification  of  an  instruction  set  often  contains 
some  manipulations  of  values  that  should  always  be  treated  as  concrete  values.  For  example,  an 
instruction-set  specification  developer  could  follow  the  approach  taken  in  the  PowerPC  manual 
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[1]  //  User-defined  abstract-syntax  grammar 

[2]  instruction:  .  .  . 

[3]  |  BCx(BOOL  BOOL  INT32  BOOL  BOOL) 

[4]  |  ■  ■  ■ ; 

[5]  //  User-defined  functions 

[6]  state  interplnstr(instruction  I,  state  S)  { 

[7]  ... 

[8]  BCx(BO,  Bl,  target,  AA,  LK): 

[9]  let .  .  . 

[10]  cia  =  RegValue32(S,  CIA());  //  current  address 

[11]  newJa  =  (AA  ?  target  //  direct:  BCA/BCLA 

[12]  :  cia  +  target);  //  relative:  BC/BCL 

[13]  Ir  =  RegValue32(S,  LR());  // linkage  address 

[14]  new  Jr  = 

[15]  (LK  ?  cia  +  4  //  change  the  link  register:  BCL/BCLA 

[16]  :  Ir);  //  do  not  change  the  link  register:  BC/BCA 

[17]  .  .  . 

[18] } 

Figure  3.8  A  fragment  of  the  PowerPC  specification  for  interpreting  BCx  instructions  (BC,  BCA, 

BCL,  BCLA). 


[27]  and  specify  variants  of  the  conditional  branch  instruction  (BC,  BCA,  BCL,  BCLA)  of  PowerPC 
by  interpreting  some  of  the  fields  in  the  instruction  (AA  and  LK)  to  determine  which  of  the  four 
variants  is  being  executed  (Fig.  3.8). 

Another  reason  that  this  issue  arises  is  that  most  well-designed  instruction  sets  have  many  reg¬ 
ularities,  and  it  is  convenient  to  factor  the  TSL  specification  to  take  advantage  of  these  regularities 
when  specifying  the  semantics.  Such  factoring  leads  to  shorter  specifications,  but  leads  to  the  in¬ 
troduction  of  auxiliary  functions  in  which  one  of  the  parameters  holds  a  constant  value  for  a  given 
instruction.  Fig.  3.9  shows  an  example  of  factoring.  The  IA32  instructions  add  and  sub  both  have 
two  operands  and  can  share  the  code  for  fetching  the  values  of  the  two  operands.  Lines  4-5  are 
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[1]  AddSublnstr(op,  dstOp,  srcOp):  //ADD  or  SUB 

[2]  let  dstVal  =  interpOp(S,  dstOp); 

[3]  srcVal  =  interpOp(S,  srcOp); 

[4]  ans  =  (op  ==  ADD()  ?  dstVal  +  srcVal 

[5]  :  dstVal- srcVal); //SUB() 

[6]  in  (...), 

[7]  .  .  . 

Figure  3.9  An  example  of  factoring  in  TSL. 

the  instruction-specific  operations;  the  equality  expression  “op  ==  ADD()”  on  line  4  can  be  (and 
should  be)  interpreted  in  concrete  semantics. 

In  both  cases,  the  precision  of  an  abstract  transformer  can  sometimes  be  improved — and  is 
never  made  worse — by  interpreting  subexpressions  associated  with  the  manipulation  of  concrete 
values  in  concrete  semantics.  For  instance,  consider  a  TSL  expression  let  v  =  (6  ?  1:2)  that 
occurs  in  a  context  in  which  b  is  definitely  a  concrete  value;  v  will  get  a  precise  value — either  1  or 
2 — when  b  is  concretely  interpreted.  However,  if  b  is  not  expressible  precisely  in  a  given  abstract 
domain,  the  conditional  expression  “(6  ?  1  :  2)”  will  be  evaluated  by  joining  the  two  branches,  and 
v  will  not  hold  a  precise  value.  (It  will  hold  the  abstraction  of  (1,  2}.) 

Binding-time  analysis.  To  address  the  issue,  we  perform  binding-time  analysis  [109]  on  the  TSL 
code,  the  outcome  of  which  is  that  expressions  associated  with  the  manipulation  of  concrete  values 
in  an  instruction  are  annotated  with  C,  and  others  with  A.  We  then  generate  the  two-level  CIR 
by  appending  CONCINTERP  for  C  values,  and  ABSINTERP  for  A  values.  The  generated  CIR  is 
instantiated  for  an  analysis  transformer  by  defining  ABSINTERP.  The  TSL  translator  supplies  a 
predefined  concrete  interpretation  for  CONCINTERP. 

The  instruction-set-specification  developer  annotates  the  top-level  user-defined  (but  reserved) 
functions,  including  interplnstr,  with  binding-time  information. 


EXPORT  <A>  interplnstr(<C>,  <A> ) 
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The  first  argument  of  type  instruction  of  interplnstr  is  annotated  with  <C>,  which  indicates 
that  all  the  data  extracted  from  the  instruction  are  treated  as  concrete ;  the  second  argument  of 
type  State  of  interplnstr  is  annotated  with  <A>,  which  indicates  that  all  the  data  extracted  from 
the  State  are  treated  as  abstract.  The  return  type  is  also  annotated  as  abstract.  The  binding-time 
information  <A>  is  propagated  to  the  caller-sites  of  interplnstr. 

More  details  of  the  TSL  syntax  for  binding-time  analysis  can  be  found  in  Appendix  A. 

3.2.2  Execution  Over  Abstract  States 

There  are  (at  least)  four  issues  that  arise:  during  the  abstract  interpretation  of  each  transformer, 
the  abstract  interpreter  must  be  able  to  (i)  execute  over  abstract  states,  (ii)  execute  both  branches 
of  a  conditional  expression,  (iii)  compare  abstract  states  and  terminate  abstract  execution  when  a 
fixed  point  is  reached,  and  (iv)  apply  widening  operators,  if  necessary,  to  ensure  termination.  The 
following  subsections  discuss  how  these  issues  are  handled  in  the  translation  to  CIR. 

3.2.2.1  Conditional  Expressions 

Fig.  3.10  shows  part  of  the  CIR  that  corresponds  to  the  TSL  expression  “let  answer  =  a  ?  b 
:  c”.  Bool3  is  an  abstract  domain  of  Booleans  (which  consists  of  three  values  (FALSE,  MAYBE, 
TRUE},  where  MAYBE  means  “may  be  FALSE  or  may  be  TRUE”).  The  TSL  conditional  expres¬ 
sion  is  translated  into  three  if-statements  (lines  3-7,  lines  8-12,  and  lines  13-15  in  Fig.  3.10).  The 
body  of  the  first  if-statement  is  executed  when  the  Bool3  value  for  a  is  possibly  false  (i.e.,  either 
FALSE  or  MAYBE).  Likewise,  the  body  of  the  second  if-statement  is  executed  when  the  Bool3 
value  for  a  is  possibly  true  (i.e.,  either  TRUE  or  MAYBE).  The  body  of  the  third  if-statement  is 
executed  when  the  Bool3  value  for  a  is  MAYBE.  Note  that  in  the  body  of  the  third  if-statement, 
answer  is  overwritten  with  the  join  of  tl  and  t2  (line  14). 

The  Bool3  value  for  the  translation  of  a  TSL  BOOL- valued  value  is  fetched  by  getBool3Value, 
which  is  one  of  the  TSL  interface  functions  that  each  interpretation  is  required  to  define  for  the 
type  BOOL.  Each  analysis  developer  decides  how  to  handle  conditional  branches  by  defining 
getBool3Value.  It  is  always  sound  for  getBool3Value  to  be  defined  as  the  constant  function  that 
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[1]  BT::BOOL  tO  =  .  .  .  ;  //translation  of  a 

[2]  BT::INT32t1,t2,  answer; 

[3]  if(Bool3::possibly_false(t0.getBool3Value()))  { 

[4]  ... 

[5]  tl  =  .  .  .  ;  //  translation  of  b 

[6]  answer  =  tl ; 

[7]  } 

[8]  if(Bool3::possibly_true(t0.getBool3Value()))  { 

[9]  ... 

[10]  t2  =  ...;//  translation  of  c 

[11]  answer  =  t2; 

[12] } 

[13]  if(t0.getBool3Value()  ==  Bool3::MAYBE)  { 

[14]  answer  =  t1.join(t2); 

[15] } 


Figure  3.10  The  translation  of  the  conditional  expression  “let  answer  =  a  ?  b  :  c'\ 


always  returns  MAYBE.  For  instance,  this  constant  function  is  useful  when  Boolean  values  cannot 
be  expressed  in  an  abstract  domain,  such  as  DUA  for  which  the  abstract  domain  for  BOOL  is  a 
set  of  uses.  For  an  analysis  where  Bool3  is  itself  the  abstract  domain  for  type  BOOL,  such  as 
VSA,  getBool3Value  returns  the  Bool3  value  from  evaluating  the  translation  of  a  so  that  either  an 
appropriate  branch  or  both  branches  can  be  abstractly  executed. 

3.2.2.2  Comparison,  Termination,  and  Widening 

Recursion  is  not  often  used  in  TSL  specifications,  but  is  needed  for  handling  some  instructions 
that  involve  iteration,  such  as  the  IA32  string-manipulation  instructions  (STOS,  LODS,  MOVS,  etc., 
with  various  REP  prefixes),  and  the  PowerPC  multiple-word  load/store  instructions  (LMW,  STMW, 
etc).  For  these  instructions,  the  amount  of  work  performed  is  controlled  either  by  the  value  of  a 
register,  the  value  of  one  or  more  strings,  etc.  These  instructions  can  be  specified  in  TSL  using 
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[1]  state  repMovsd(state  S,  INT32  count)  { 

[1]  state  globaLS; 

[2]  count  ==  0 

[2]  BT::INT32  globaLcount; 

[3] 

?  S 

[3]  state  globaLretval; 

[4] 

:  with(S)  ( 

[4]  BT::state  repMovsd( 

[5] 

State(mem,  regs,  flags): 

[5] 

INTERP::state  S,  BT::INT32  count)  { 

[6] 

let  direction  =  flags(DF()); 

[6] 

globaLS  =  _L; 

[7] 

edi  =  regs(EDI()); 

[7] 

globaLcount  =  _L; 

[8] 

esi  =  regs(ESI()); 

[8] 

globaLretval  =  _L; 

[9] 

src  =  MemAccess_32_8_LE_32(mem,  esi); 

[9] 

return  repMovsdAux(S,  count); 

[10] 

newRegs  =  direction 

[10]}; 

[11] 

?  regs[EDI()|— >edi-4][ESI()|— >esi-4] 

[1 1]INTERP::state  repMovsdAux( 

[12] 

:  regs[EDI()  ->edi+4][ESI()|->esi+4] 

[12] 

INTERP::state  S,  BT::INT32  count)  { 

[13] 

newMem  =  MemUpdate_32_8_LE_32( 

[13] 

//  Widen  and  test  for  convergence 

[14] 

memory,  edi,  src); 

[14] 

state  tmp_S  =  globaLS  V  (globaLS  u  S); 

[15] 

newS  =  State(newMem,  newRegs,  flags); 

[15] 

BT::INT32  tmp.count  = 

[16] 

in  (  repMovsd(newS,  count  -  1)  ) 

[16] 

globaLcount  V  (globaLcount  u  count); 

[17] 

) 

[17] 

if(tmp_S  c  globaLS 

[18]}; 

[18] 

&&  tmp.count  c  globaLcount)  { 

[19] 

return  globaLretval; 

[20] 

} 

[21] 

S  =  tmp_S;  globaLS  =  tmp_S; 

[22] 

count  =  tmp.count;  globaLcount  =  tmp.count; 

[23] 

[24] 

//  translation  of  the  body  of  repMovsd 

[25] 

[26] 

state  newS  =  .  .  .  ; 

[27] 

state  t  =  repMovsdAux(newS,  count  -  1); 

[28] 

globaLretval  =  globaLretval  u  t; 

[29] 

return  globaLretval; 

[30]}; 

Figure  3.11  (a)  A  recursive  TSL  function,  (b)  The  translation  of  the  recursive  function  from  (a). 
For  simplicity,  some  mathematical  notation  is  used,  including  U  (join),  SJ  (widening),  C 

(approximation),  and  _L  (bottom). 
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recursion.5  For  each  recursive  function  specified  by  an  instruction-set  specification  developer,  the 
TSL  system  generates  a  function  that  appropriately  compares  abstract  values  and  terminates  the 
recursion  if  abstract  values  are  found  to  be  equal  (i.e.,  the  recursion  has  reached  a  fixed  point).  The 
function  is  also  prepared  to  apply  the  widening  operator  that  the  analysis  developer  has  specified 
for  the  abstract  domain  in  use. 

For  example,  Fig.  3.11(a)  shows  the  user-defined  TSL  function  that  handles  “rep  movsd”,  which 
copies  the  contents  of  one  area  of  memory  to  a  second  area.6  The  amount  of  memory  to  be  copied 
is  passed  into  the  function  as  the  argument  count.  Fig.  3.1 1(b)  shows  its  translation  into  the  CIR.  A 
recursive  function  like  repMovsd  (Fig.  3.1 1(a))  is  automatically  split  by  the  TSL  compiler  into  two 
functions,  repMovsd  (line  4  of  Fig.  3.11(b))  and  repMovsdAux  (line  11  of  Fig.  3.11(b)).  The  TSL 
system  initializes  appropriate  global  variables  globaLS  and  globaLcount  (lines  6-8)  in  repMovsd, 
and  then  calls  repMovsdAux  (line  9).  At  the  beginning  of  repMovsdAux,  it  generates  statements 
that  widen  each  of  the  global  variables  with  respect  to  the  arguments,  and  test  whether  all  of  the 
global  variables  have  reached  a  fixpoint  (lines  13-17).  If  so,  repMovsdAux  returns  globaLretval 
(line  19).  If  not,  the  body  of  repMovsdAux  is  analyzed  again  (lines  24-27).  Note  that  at  the 
translation  of  each  normal  return  from  repMovsdAux  (e.g.,  line  28),  the  return  value  is  joined  into 
globaLretval.  The  TSL  system  requires  each  analysis  developer  to  define  the  functions  join  and 
widen  for  the  basetypes  of  the  interpretation  used  in  the  analysis. 

3.2.3  Paired  Semantics 

Our  system  allows  easy  instantiations  of  reduced  products  [74]  by  means  of  paired  semantics. 
The  TSL  system  provides  a  template  for  paired  semantics  as  shown  in  Fig.  3.12(a). 

The  CIR  is  instantiated  with  a  paired  semantic  domain  defined  with  two  interpretations, 
INTERP1  and  INTERP2  (each  of  which  may  itself  be  a  paired  semantic  domain),  as  shown  on 
line  1  of  Fig.  3.12(b).  The  communication  between  interpretations  may  take  place  in  basetype- 
operators  or  access/update  functions;  Fig.  3.12(b)  is  an  example  of  the  latter.  The  two  components 
5Currently,  TSL  supports  only  tail-recursion. 

6repMovsd  is  called  by  interpl  nstr,  which  passes  in  the  value  of  register  ecx,  and  sets  ecx  to  0  after  repMovsd 
returns. 
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[1]  template  ctypename  INTERP1,  typename  INTERP2> 

[2]  class  PairedSemantics  { 

[3]  typedef  PairedBaseType<INTERP1  ::INT32,  INTERP2::INT32>  INT32; 

[4]  ... 

(a)  [5]  INT32  MemAccess_32_8_LE_32(MEMMAP32_8  mem,  INT32  addr)  { 

[6]  return  INT32(INTERP1  ::MemAccess.32_8.LE_32(mem.GetFirst(),  addr.GetFirst()), 

[7]  INTERP2::MemAccess_32_8_LE_32(mem.GetSecond(),  addr.GetSecond())); 

[8]  } 

[9]  }; _ 

[1]  typedef  PairedSemantics<VSAJNTERP,  DUAJNTERP>  DUA; 

[2]  templateo  DUA::INT32  DUA::MemAccess_32_8.LE.32( 

[3]  DUA::MEMMAP32_8  mem,  DUA::INT32  addr)  { 

[4]  DUA::INTERP1  ::MEMMAP32_8  memoryl  =  mem.GetFirst(); 

[5]  DUA::INTERP2::MEMMAP32_8  memory2  =  mem.GetSecondf); 

(b) 

[6]  DUA::INTERP1  ::INT32  addrl  =  addr.GetFirst(); 

[7]  DUA::INTERP2::INT32  addr2  =  addr.GetSecond(); 

[8]  DUA::INT32  answer  =  interact( meml ,  mem2,  addrl ,  addr2); 

[9]  return  answer; 

[10] } _ 

Figure  3.12  (a)  A  part  of  the  template  class  for  paired  semantics;  (b)  an  example  of  C++  explicit 

template  specialization  to  create  a  reduced  product. 

of  the  paired-semantics  values  are  deconstructed  on  lines  4-7  of  Fig.  3.12(b),  and  the  individ¬ 
ual  INTERP1  and  INTERP2  components  from  both  inputs  can  be  used  (as  illustrated  by  the  call 
to  interact  on  line  8  of  Fig.  3.12(b))  to  create  the  paired-semantics  return  value,  answer.  Such 
overridings  of  basetype-operators  and  access/update  functions  are  done  by  C++  explicit  special¬ 
ization  of  members  of  class  templates  (this  is  specified  in  C++  by  “templateo”;  see  line  2  of 
Fig.  3.12(b)). 

We  also  found  this  method  of  CIR  instantiation  to  be  useful  to  perform  a  form  of  reduced  prod¬ 
uct  when  analyses  are  split  into  multiple  phases,  as  in  a  tool  like  CodeSurfer/x86.  CodeSurfer/x86 
carries  out  many  analysis  phases,  and  the  application  of  its  sequence  of  basic  analysis  phases  is 
itself  iterated.  On  each  round,  CodeSurfer/x86  applies  a  sequence  of  analyses:  VSA,  DUA,  and 
several  others.  VSA  is  the  primary  workhorse,  and  it  is  often  desirable  for  the  information  acquired 
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[1]  with(op)  ( .  .  . 

[2]  Indirect32(base,  index,  scale,  disp): 

[3]  let  addr  =  base 

[4]  +  index  *  SignExtend8To32(scale) 

[5]  +  disp; 

[6]  m  =  MemUpdate_32_8_LE_32( 

[7]  mem, addr, v); 

[8]  .  .  .) 

Figure  3.13  A  fragment  of  update  State. 

by  VSA  to  influence  the  outcomes  of  other  analysis  phases  by  pairing  the  VSA  interpretation  with 
another  interpretation. 

We  can  use  the  paired-semantics  mechanism  to  obtain  desired  multi-phase  interactions  among 
our  generated  analyzers — typically,  by  pairing  the  VSA  interpretation  with  another  interpretation. 
For  instance,  with  DUAJNTERP  alone,  the  information  required  to  obtain  abstract  memory  lo¬ 
cation^)  for  addr  is  lost  because  the  DUA  basetype-operators  (used  on  +  and  *  on  lines  4-5  of 
Fig.  3.13)  just  return  the  union  of  the  arguments’  use  sets.  With  the  pairing  of  VSAJNTERP 
with  DUAJNTERP  (line  1  of  Fig.  3.12(b)),  DUA  can  use  the  abstract  address  computed  for  addr2 
(line  7  of  Fig.  3.12(b))  by  VSAJNTERP,  which  uses  VSAJNTERP:  Add  and  VSAJNTERP::Mult; 
the  latter  operators  operate  on  a  numeric  abstract  domain  (rather  than  a  set-based  one). 

Note  that  during  the  application  of  the  paired  semantics,  VSA  interpretation  will  be  carried 
out  on  the  VSA  component  of  paired  intermediate  values.  In  some  sense,  this  is  duplicated  work; 
however,  a  paired  semantics  is  typically  used  only  in  a  phase  of  transformer  generation  where 
the  transformers  are  generated  during  a  single  pass  over  the  interprocedural  CFG  to  generate  a 
transformer  for  each  instruction.  Thus,  only  a  limited  amount  of  VSA  evaluation  is  performed 
(equal  to  what  would  be  performed  to  check  that  the  VSA  solution  is  a  fixed  point). 
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3.3  TSL-Generated  Analysis  Components 

In  this  section,  we  present  various  analyses  that  are  created  by  the  TSL  system.  As  illustrated  in 
Fig.  3.7,  a  version  of  the  interface  function  interplnstr  is  created  for  each  analysis.  Each  analysis 
engine  calls  interplnstr  at  appropriate  moments  to  obtain  a  transformer  for  an  instruction  being 
processed.  Analysis  engines  can  be  categorized  as  follows: 

•  Worklist-Based  Value  Propagation  (or  Transformer  Application)  [TA].  These  perform  clas¬ 
sical  worklist-based  value  propagation  in  which  generated  transformers  are  applied,  and 
changes  are  propagated  to  successors/predecessors  (depending  on  propagation  direction). 
Context  sensitivity  in  such  analyses  is  supported  by  means  of  the  call-string  approach  [169]. 
VS  A  uses  this  kind  of  analysis  engine  (§3.3.1). 

•  Transformer  Composition  [TC].  These  generally  perform  flow-sensitive,  context-sensitive 
interprocedural  analysis.  DUA  (§3.3.3)  uses  this  kind  of  analysis  engine. 

•  Unification-Based  Analyses  [UB].  These  perform  flow-insensitive  interprocedural  analysis. 
ASI  (§3.3.4)  uses  this  kind  of  analysis  engine. 

For  each  analysis,  the  CIR  is  instantiated  with  an  interpretation  by  an  analysis  developer.  This 
mechanism  provides  wide  flexibility  in  how  one  can  couple  the  system  to  an  external  package.  One 
approach,  used  with  VSA,  is  that  the  analysis  engine  (written  in  C++)  calls  interplnstr  directly. 
In  this  case,  the  instantiated  CIR  serves  as  a  transformer  evaluator:  interplnstr  is  prepared  to 
receive  an  instruction  and  an  abstract  state,  and  return  an  abstract  state.  Another  approach,  used 
in  DUA,  is  employed  when  interfacing  to  an  analysis  component  that  has  its  own  input  language 
for  specifying  abstract  transformers.  In  this  case,  the  instantiated  CIR  serves  as  a  transformer 
generator,  interplnstr  is  prepared  to  receive  an  instruction  and  a  default  abstract  state7  and  return 
a  transformer  specification  in  the  analysis  component’s  input  language. 

The  following  subsections  discuss  how  the  CIR  is  instantiated  for  various  analyses. 

'  In  the  case  of  transformer  generation  for  a  TC  analyzer,  the  default  state  is  the  identity  function. 
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3.3.1  Creation  of  a  TA  Transformer  Evaluator  for  VSA 

VSA  is  a  combined  numeric-analysis  and  pointer-analysis  algorithm  that  determines  an  over¬ 
approximation  of  the  set  of  numeric  values  and  addresses  that  each  register  and  memory  location 
holds  at  each  program  point  [41].  A  memory  region  is  an  abstract  quantity  that  represents  all 
runtime  activation  records  of  a  procedure.  To  represent  a  set  of  numeric  values  and  addresses,  VSA 
uses  value-sets,,  where  a  value-set  is  a  map  from  memory  regions  to  strided  intervals.  A  strided 
interval  consists  of  a  lower  bound  lb,  a  stride  s,  and  an  upper  bound  lb  +  ks,  and  represents  the  set 
of  numbers  {lb,  Ib+s,  lb  +  2s,  lb  +  ks}  [160]. 

The  Interpretation  of  Basetypes  and  Basetype-Operators.  The  abstract  domain  for  the  integer 
basetypes  is  a  value-set.  The  abstract  domain  for  BOOL  is  Bool3  ({FALSE,  MAYBE,  TRUE}), 
where  MAYBE  means  “may  be  FALSE  or  may  be  TRUE”.  The  operators  on  these  domains  are 
described  in  detail  in  [160]. 

The  Interpretation  of  Map-Basetypes  and  Access/Update  Functions.  The  abstract  domain  for 
memory  maps  (MEMMAP32_8,  MEMMAP64_8,  etc.)  is  a  dictionary  that  maps  each  abstract  mem¬ 
ory  location  (i.e.,  the  abstraction  of  INT32)  to  a  value-set.  The  abstract  domain  for  register  maps 
(REGMAP32,  REGMAP64,  etc.)  is  a  dictionary  that  maps  each  variable  (reg32,  reg64,  etc.)  to 
a  value-set.  The  abstract  domain  for  flag  maps  (FLAGMAP)  is  a  dictionary  that  maps  a  flag  to  a 
Bool3.  The  access/update  functions  access  or  update  these  dictionaries. 

VSA  uses  this  transformer  evaluator  to  create  an  output  abstract  state,  given  an  instruction  and 
an  input  abstract  state.  For  example,  row  1  of  Tab.  3.2  shows  the  generated  VSA  transformer  for 
the  instruction  “add  ebx,  eax”.  The  VSA  evaluator  returns  a  new  abstract  state  in  which  ebx  is 
updated  with  the  sum  of  the  values  of  ebx  and  eax  from  the  input  abstract  state  and  the  flags  are 
updated  appropriately. 

3.3.2  Creation  of  a  TC  Transformer  Generator  for  ARA 

An  affine  relation  is  a  linear-equality  constraint  between  integer- valued  variables.  ARA  finds 
affine  relations  that  hold  in  the  program,  for  a  given  set  of  variables.  This  analysis  is  used  to  find 
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induction-variable  relationships  between  registers  and  memory  locations;  these  help  in  increasing 
the  precision  of  VSA  when  interpreting  conditional  branches  (§3.2.2. 1)  [38]. 

The  principle  that  is  used  to  create  a  TC  transformer  generator  is  as  follows:  by  interpreting 
the  TSL  expression  that  defines  the  semantics  of  an  individual  instruction  using  an  abstract  domain 
in  which  values  represent  transformers,  each  call  to  interplnstr  will  residuate  a  transformer  for  the 
instruction.  In  the  case  of  ARA,  the  CIR  is  instantiated  so  that  for  each  instruction,  the  generated 
transformer  operates  on  an  abstract  domain  whose  values  are  sets  of  matrices  that  represent  affine 
transformations  on  registers  and  memory  locations  of  the  state  [141]. 

Interpretation  of  Basetypes  and  Basetype-Operators.  The  abstract  domain  for  the  integer  base- 
types  is  a  set  of  linear  expressions  in  which  variables  are  either  a  register  or  an  abstract  memory 
location — the  actual  representation  of  the  domain  is  a  set  of  columns  that  consist  of  an  integer  con¬ 
stant  and  an  integer  coefficient  for  each  variable.  This  column  represents  an  affine  expression  over 
the  values  that  the  variables  hold  at  the  beginning  of  the  instruction.  The  basetype  operations  are 
defined  so  that  only  a  set  of  linear  expressions  can  be  generated;  any  operation  that  leads  to  a  non¬ 
linear  expression,  such  as  Timesfeax,  ebx),  returns  TOP,  which  means  that  no  affine  relationship 
is  known  to  hold. 

Interpretation  of  Map-Basetypes  and  Access/Update  Functions.  The  abstract  domain  of  the  maps 
for  ARA  is  a  set  of  matrices  of  size  (N  +  1)  x  (N  +  1),  where  N  is  the  number  of  variables. 
This  abstraction,  which  is  able  to  find  all  affine  relationships  in  an  affine  program,  was  defined 
by  Muller-Olm  and  Seidl  [141].  Each  access  function  extracts  a  set  of  columns  associated  with 
the  variable  it  takes  as  an  argument,  from  the  set  of  matrices  for  its  map  argument.  Each  update 
function  creates  a  new  set  of  matrices  that  reflects  the  affine  transformation  associated  with  the 
update  to  the  variable  in  question. 

For  each  instruction,  the  ARA  transformer  relates  linear-equality  relationships  that  hold  before 
the  instruction  to  those  that  hold  after  execution  of  the  instruction. 
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3.3.3  Def-Use  Analysis  (DUA) 

Def-Use  analysis  finds  the  relationships  between  definitions  ( defs )  and  uses  of  state  components 
(registers,  flags,  and  memory-locations)  for  each  instruction. 

The  Interpretation  of  Basetypes  and  Basetype-Operators.  The  abstract  domain  for  the  basetypes 
is  a  set  of  uses  (i.e.,  abstractions  of  the  map-keys  in  states,  such  as  registers,  flags,  and  abstract 
memory  locations),  and  the  operators  on  this  domain  perform  a  set  union  of  their  arguments’  sets. 
The  Interpretation  of  Map-Basetypes  and  Access/Update  Functions.  The  abstract  domains  of  the 
maps  for  DUA  are  dictionaries  that  map  each  def  to  a  set  of  uses.  Each  access  function  returns  the 
set  of  uses  associated  with  the  key  parameter.  Each  update  function  update(D,  k,  S),  where  D  is 
a  dictionary,  k  is  one  of  the  state  components,  and  S'  is  a  set  of  uses ,  returns  an  updated  dictionary 
D[k  i— »  (D (k)  U  S')]  (or  D[k  i — ^  S]  if  a  strong  update  is  sound). 

The  DUA  results  (e.g.,  row  2  of  Tab.  3.2)  are  used  to  create  transformers  for  several  additional 
analyses,  such  as  GMOD  analysis  [72],  which  is  an  analysis  to  find  modified  variables  for  each 
function  /  (including  variables  modified  by  functions  transitively  called  from  /)  and  live-flag 
analysis,  which  is  used  in  our  version  of  VSA  to  perform  trace-splitting/collapsing  (see  §3.3.5). 

3.3.4  Creation  of  a  UB  Transformer  Generator  for  ASI 

AS  I  is  a  unification-based,  flow-insensitive  algorithm  to  identify  the  structure  of  aggregates  in 
a  program  [42].  For  each  instruction,  the  transformer  generator  generates  a  set  of  ASI  commands, 
each  of  which  is  either  a  command  to  split  a  memory  region  or  a  command  to  unify  some  portions 
of  memory  (and/or  some  registers).  At  analysis  time,  a  client  analyzer  typically  applies  the  trans¬ 
former  generator  to  each  of  the  instructions  in  the  program,  and  then  feeds  the  resulting  set  of  ASI 
commands  to  an  ASI  solver  to  refine  the  memory  regions. 

The  Interpretation  of  Basetypes  and  Basetype-Operators.  The  abstract  domain  for  the  basetypes 
is  a  set  of  datarefs,  where  a  dataref  is  an  access  on  specific  bytes  of  a  register  or  memory.  The 
arithmetic,  logical,  and  bit-vector  operations  tag  datarefs  as  non-unifiable  datarefs,  which  means 
that  they  will  only  be  used  to  generate  splits. 
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The  Interpretation  of  Map-Basetypes  and  Access/Update  Functions.  The  abstract  domain  of  the 
maps  for  AS  I  is  a  set  of  splits  and  unifications.  The  access  functions  generate  a  set  of  datarefs 
associated  with  a  memory  location  or  register.  The  update  functions  create  a  set  of  unifications  or 
splits  according  to  the  datarefs  of  the  data  argument. 

For  example,  for  the  instruction  “mov  [ebx],eax'\  when  ebx  holds  the  abstract  address 
AR_foo— 12,  where  ARJ'oo  is  the  memory  region  for  the  activation  records  of  procedure  foo,  the 
ASI  transformer  generator  emits  one  ASI  unification  command  ‘AR_foo\-\2:-9\  :=:  eax[0:3]”. 

3.3.5  Quantifier-Free  Bit-Vector  (QFBV)  Semantics 

QFBV  semantics  provides  a  way  to  obtain  a  symbolic  representation — as  a  formula  in  first- 
order  quantifier-free  bit- vector  logic — of  an  instruction’s  semantics. 

The  Interpretation  of  Basetypes  and  Basetype-Operators.  The  abstract  domain  for  the  integer 
basetypes  is  a  set  of  terms,  and  each  operator  constructs  a  term  that  represents  the  operation.  The 
abstract  domain  for  BOOL  is  a  formula,  and  each  BOOL- valued  operator  constructs  a  formula  that 
represents  the  operation. 

The  Interpretation  of  Map-Basetypes  and  Access/Update  Functions.  The  abstract  domain  for  the 
state  components  is  a  dictionary  that  maps  a  storage  component  to  a  term  (or  a  formula  in  the  case 
of  FLAG  MAP).  The  access/update  functions  retrieve  from  and  update  the  dictionaries,  respectively. 

QFBV  semantics  is  useful  for  a  variety  of  purposes.  One  use  is  as  auxiliary  information  in  an 
abstract  interpreter,  such  as  the  VSA  analysis  engine,  to  provide  more  precise  abstract  interpretation 
of  branches  in  low-level  code.  The  issue  is  that  many  instruction  sets  provide  separate  instructions 
for  (i)  setting  flags  (based  on  some  condition  that  is  tested)  and  (ii)  branching  according  to  the 
values  held  by  flags. 

To  address  this  problem,  we  use  a  trace-splitting/collapsing  scheme  [136].  The  VSA  analysis 
engine  partitions  the  state  at  each  flag-setting  instruction  based  on  live-flag  information  (which  is 
obtained  from  an  analysis  that  uses  the  DUA  transformers);  a  semantic  reduction  [74]  is  performed 
on  the  split  VSA  states  with  respect  to  a  formula  obtained  from  the  transformer  generated  by  the 
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Figure  3.14  An  example  for  trace- splitting 

QFBV  semantics.  The  set  of  VSA  states  that  result  are  propagated  to  appropriate  successors  at  the 
branch  instruction  that  uses  the  flags. 

The  cmp  instruction  (A)  in  Fig.  3.14,  which  is  a  flag-setting  instruction,  has  sf  and  zf  as 
live  flags  because  those  flags  are  used  at  the  branch  instructions  js  (B)  and  jz  (E):  js  and  jz  jump 
according  to  sf  and  zf ,  respectively.  After  interpretation  of  (A),  the  state  S  is  split  into  four  states, 
Si,  S2,  S3,  and  S4,  which  are  reduced  with  respect  to  the  formulas  tpi.  (eax  —  10  <  0)  associated 
with  sf ,  and  <^2:  (eax  —  10  ==  0)  associated  with  zf . 

51  :=  S[sf  i — ^T]  [zf  t— >  T]  [eax  i— >  reduced S(eax),  ip\  A  <^2)] 

52  :=  S[sf  h- >T]  [zf  i — >  F]  [eax  i— >  reduced S(eax),  <y9i  A  — ■v?2)] 

53  :=  S[sf  1 — >F]  [zf  1— >  T]  [eax  1— >  reduced S(eax),  -xpi  A  </?2)] 

54  :=  S[sf  1 — >F]  [zf  1 — >  F]  [eax  1— >•  reduce^ S(eax),  -><. p±  A  ->^2)] 

Because  ipi  A  ip2  is  not  satisfiable,  Si  becomes  _L.  State  S2  is  propagated  to  the  true  branch  of 
js  (i.e.,  just  before  (C)),  and  S3  and  S4  to  the  false  branch  (i.e.,  just  before  (D)).  Because  no  flags 
are  live  just  before  (C),  the  splitting  mechanism  maintains  just  a  single  state,  and  thus  all  states 
propagated  to  (C) — here  there  is  just  one — are  collapsed  to  a  single  abstract  state.  Because  zf  is 
still  live  until  (E),  the  states  S3  and  S4  are  maintained  as  separate  abstract  states  at  (D). 
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3.4  Measures  of  Success 

As  an  example  of  the  kind  of  leverage  that  TSL  provides,  the  most  recent  incarnation  of 
CodeSurfer/x86 — a  revised  version  whose  analysis  components  are  implemented  via  TSL — uses 
eight  separate  reinterpretations  generated  from  the  TSL  specification  of  the  IA32  instruction  set.  We 
estimate  that  the  task  of  writing  transformers  (for  the  eight  analysis  phases  used  in  CodeSurfer/x86) 
consumed  about  20  man-months;  in  contrast,  we  have  invested  a  total  of  about  1  man-month  to 
write  the  C++  code  for  the  set  of  TSL  interpretations  that  are  used  to  generate  the  replacement 
components.  To  this,  one  should  add  10-20  man-days  to  write  the  TSL  specification  for  IA32:  the 
current  specification  for  IA32  consists  of  2,834  (non-comment,  non-blank)  lines  of  TSL. 

Because  each  analysis  is  defined  at  the  meta-level  (i.e.,  by  providing  an  interpretation  for  the 
collection  of  TSL  primitives),  abstract  transformers  for  a  given  analysis  can  be  created  automati¬ 
cally  for  each  instruction  set  that  is  specified  in  TSL.  For  instance,  from  the  PowerPC  specification 
(1,370  non-comment,  non-blank  lines,  which  took  approximately  4  days  to  write),  we  were  imme¬ 
diately  able  to  generate  PowerPC-specific  versions  of  all  of  the  analysis  components  that  had  been 
developed  for  the  IA32  instruction  set. 

It  takes  approximately  8  seconds  (on  an  Intel  Pentium  4  with  a  3.00GHz  CPU  and  2GB  of 
memory,  running  Centos  4)  for  the  TSL  (cross-)compiler  to  compile  the  IA32  specification  to  C++, 
followed  by  approximately  20  minutes  wall-clock  time  (on  an  Intel  Pentium  4  with  a  1.73GHz 
CPU  and  1.5GB  of  memory,  running  Windows  XP)  to  compile  the  generated  C++. 

It  is  natural  to  ask  how  the  TSL-generated  analyses  perform  compared  to  their  hand-coded 
counterparts.  Due  to  the  nature  of  the  transformers  used  in  one  of  the  analyses  that  we  imple¬ 
mented  (affine-relation  analysis  (ARA)  [141]),  it  was  possible  to  write  an  algorithm  to  compare  the 
TSL-generated  ARA  transformers  with  the  hand-coded  ARA  transformers  that  were  incorporated  in 
CodeSurfer/x86.  On  a  corpus  of  542  instruction  instances  that  covered  various  opcodes,  address¬ 
ing  modes,  and  operand  sizes,  we  found  that  the  TSL-generated  transformers  were  equivalent  in 
324  cases  and  more  precise  than  the  hand-coded  transformers  in  the  remaining  218  cases  (40%). 
For  87  cases,  this  was  because  in  rethinking  how  the  ARA  abstraction  could  be  encoded  using  TSL 
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hand-coded  ARA  transformers 

TSL-generated  ARA  transformers 

time  (sec) 

0.032 

0.281 

total  #  of  memory  allocs 

4,735 

31,234 

max  #  of  memory  allocs 

20 

682 

Figure  3.15  Time  (in  seconds)  and  the  total/maximum  number  of  memory  allocations  for  getting 
TSL-generated  ARA  transformers  and  hand-coded  transformers. 


mechanisms,  we  discovered  an  easy  way  to  extend  [141]  to  retain  some  information  for  8-,  16-,  and 
64-bit  operations.  (In  principle,  these  could  have  been  incorporated  into  the  hand-coded  version, 
too.) 

The  other  131  cases  of  improvement  can  be  ascribed  to  “fatigue  factor”  on  the  part  of  the 
human  programmer:  the  hand-coded  versions  adopted  a  pessimistic  view  and  just  treated  certain 
instructions  as  always  assigning  an  unknown  value  to  the  registers  that  they  affected,  regardless 
of  the  values  of  the  arguments.  Because  the  TSL-generated  transformers  are  based  on  the  ARA 
interpretation’s  definitions  of  the  TSL  basetype-operators,  the  TSL-generated  transformers  were 
more  thorough:  a  basetype-operator’s  definition  in  an  interpretation  is  used  in  all  places  that  the 
operator  arises  in  the  specification  of  the  instruction  set’s  concrete  semantics. 

We  measured  time  and  memory  consumption  to  answer  the  question  “how  costly  is  it  to  use 
the  TSL-generated  analyses”.  Fig.  3.15  compares  the  time  (in  seconds)  and  memory  consumption 
(in  number  of  memory  allocations  for  matrices,  which  are  used  in  the  representation  of  abstract  el¬ 
ements  in  the  abstract  domain  for  ARA)  taken  for  obtaining  542  TSL-generated  ARA  transformers 
with  the  time  and  memory  for  obtaining  the  corresponding  ARA  transformers  by  the  hand-coded 
method  that  was  used  in  the  original  CodeSurfer/x86.  The  TSL-based  method  takes  about  8  times 
longer  than  the  hand-coded  approach  and  causes  about  7  times  more  memory  allocations.  In  TSL, 
all  the  abstract  operations  (matrix  manipulations)  are  performed  at  the  meta-level  essentially  in  a 
side-effect-free  functional  environment.  Therefore,  there  can  be  many  unnecessary  memory  alloca¬ 
tions  and  object  copies  at  the  meta-operator  boundaries.  However,  there  is  a  room  for  improvement 
by  optimization.  Also,  TSL  still  takes  less  than  a  second  for  obtaining  542  ARA  transformers.  In 


92 


the  light  of  the  performance  measurement  of  ARA,  which  is  the  most  memory-intensive  analy¬ 
sis  we  have  created  using  the  TSL  system,  TSL-generated  analysis  does  not  cause  a  significant 
performance  degradation. 

We  also  carried  out  a  study  using  an  algorithm  for  obtaining  “best  transformer”.  For  a  given 
instruction  I,  the  TSL  QFBV  reinterpretation  was  used  to  obtain  a  formula  pj  that  expresses  the 
semantics  of  I.  The  formula  ipi  was  then  used  to  obtain  (a  close  approximation  to)  the  best  ARA 
transformer  that  over-approximates  ipi,  using  the  techniques  described  in  [116,  161].  About  8.5% 
of  the  ARA  transformers  generated  via  the  best-transformer  algorithm  were  more  precise  than  the 
ARA  transformers  generated  via  the  TSL-based  method.  However,  there  is  a  trade-off  between 
precision  and  speed:  the  best-transformer  method  is  about  600  times  slower  (as  of  May  3,  201 1) 
than  the  TSL-based  method. 

Leverage 

The  TSL  system  provides  two  dimensions  of  parameterizability:  different  instruction  sets  and 
different  analyses.  Each  instruction-set  specification  developer  writes  an  instruction-set  seman¬ 
tics,  and  each  analysis  developer  defines  an  abstract  domain  for  a  desired  analysis  by  giving  an 
interpretation  (i.e.,  the  implementations  of  TSL  basetypes,  basetype-operators,  and  access/update 
functions).  Given  the  inputs  from  these  two  classes  of  users,  the  TSL  system  automatically  gener¬ 
ates  an  analysis  component.  Note  that  the  work  that  an  analysis  developer  performs  is  TSL-specific 
but  independent  of  each  language  to  be  analyzed;  from  the  interpretation  that  defines  an  analysis, 
the  abstract  transformers  for  that  analysis  can  be  generated  automatically  for  every  instruction  set 
for  which  one  has  a  TSL  specification.  Thus,  to  create  M  x  N  analysis  components,  the  TSL  sys¬ 
tem  only  requires  M  specifications  of  the  concrete  semantics  of  instruction  sets,  and  N  analysis 
implementations  (Fig.  3.1),  i.e.,  M  +  N  inputs  to  obtain  M  x  N  analysis-component  implementa¬ 
tions. 

The  TSL  system  provides  considerable  leverage  for  implementing  analysis  tools  and  experi¬ 
menting  with  new  ones.  New  analyses  are  easily  implemented  because  a  clean  interface  is  provided 
for  defining  an  interpretation. 
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TSL  as  a  Tool  Generator.  A  tool  generator  (or  tool-component  generator)  such  as  YACC  [107] 
takes  a  declarative  description  of  some  desired  behavior  and  automatically  generates  an  implemen¬ 
tation  of  a  component  that  behaves  in  the  desired  way.  Often  the  generated  component  consists  of 
generated  tables  and  code,  plus  some  unchanging  driver  code  that  is  used  in  each  generated  tool 
component.  The  advantage  of  a  tool  generator  is  that  it  creates  correct-by-construction  implemen¬ 
tations. 

For  machine-code  analysis,  the  desired  components  each  consist  of  a  suitable  abstract  inter¬ 
pretation  of  the  instruction  set,  together  with  some  kind  of  analysis  driver  (a  solver  for  finding  the 
fixed-point  of  a  set  of  dataflow  equations,  a  symbolic  evaluator  for  performing  symbolic  execu¬ 
tion,  etc.).  TSL  is  a  system  that  takes  a  description  of  the  concrete  semantics  of  an  instruction  set, 
a  description  of  an  abstract  interpretation,  and  creates  an  implementation  of  an  abstract  interpreter 
for  the  given  instruction  set. 

TSL  :  concrete  semantics  x  abstract  domain  — >  abstract  semantics. 

In  that  sense,  TSL  is  a  tool  generator  that,  for  a  fixed  instruction-set  semantics,  automatically 
creates  different  abstract  interpreters  for  the  instruction  set. 

The  reinterpretation  mechanism  allows  TSL  to  be  used  to  implement  tool-component  genera¬ 
tors  and  tool  generators.  Each  implementation  of  an  analysis  component’s  driver  (e.g.,  fixed-point- 
finding  solver,  symbolic  executor)  serves  as  the  unchanging  driver  for  use  in  different  instantiations 
of  the  analysis  component  for  different  instruction  sets.  The  TSL  language  becomes  the  specifica¬ 
tion  language  for  retargeting  that  analysis  component  for  different  instruction  sets: 
analyzer  generator  =  abstract-semantics  generator  +  analysis  driver. 

For  tools  like  CodeSurfer/x86,  which  incorporates  multiple  analysis  components,  we  thereby  ob¬ 
tain  YACC-like  tool  generators  for  such  tools: 

concrete  semantics  of  L  — >  Tool/L. 

Consistency.  In  addition  to  leverage  and  thoroughness,  for  a  system  like  CodeSurfer/x86 — 
which  uses  multiple  analysis  phases — automating  the  process  of  creating  abstract  transformers 
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ensures  semantic  consistency;  that  is,  because  analysis  implementations  are  generated  from  a  sin¬ 
gle  specification  of  the  instruction  set’s  concrete  semantics,  this  guarantees  that  a  consistent  view 
of  the  concrete  semantics  is  adopted  by  all  of  the  analyses  used  in  the  system. 

3.5  Related  Work 

In  this  section,  we  discuss  work  from  various  domains  that  relates  to  TSL.  §3.5.1  compares  the 
way  we  use  the  technique  of  reinterpreting  TSL’s  base-types  and  meta-operators  to  the  concept  of 
refactoring  as  in  the  original  work  on  semantic  reinterpretation  [110,  134,  144,  148].  §3.5.2  dis¬ 
cusses  some  instruction-set-description  languages  developed  for  various  purposes.  §3.5.3  presents 
various  existing  systems  for  creating  analyzers  and  transformers. 

3.5.1  Semantic  Reinterpretation 

As  discussed  in  §3.1,  semantic  reinterpretation  involves  refactoring  the  specification  of  a  lan¬ 
guage’s  concrete  semantics  into  a  suitable  form  by  introducing  appropriate  combinators  that  are 
subsequently  redefined  to  create  the  different  subject-language  interpretations. 

Semantic  Reinterpretation  Versus  Standard  Abstract  Interpretation.  Semantic  reinterpreta¬ 
tion  [110,  134,  144,  148]  is  a  form  of  abstract  interpretation  [73],  but  differs  from  the  way  abstract 
interpretation  is  normally  applied:  in  standard  abstract  interpretation,  one  reinterprets  the  con¬ 
structs  of  each  subject  language;  in  contrast,  with  semantic  reinterpretation  one  reinterprets  the 
constructs  of  the  meta-language.  Standard  abstract  interpretation  helps  in  creating  semantically 
sound  tools;  semantic  reinterpretation  helps  in  creating  semantically  sound  tool  generators.  In 
particular,  if  you  have  N  subject  languages  and  M  analyses,  with  semantic  reinterpretation  you 
obtain  N  x  M  analyzers  by  writing  just  N  +  M  specifications:  concrete  semantics  for  N  subject 
languages  and  M  reinterpretations.  With  the  standard  approach,  one  must  write  N  x  M  abstract 
semantics. 
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As  originally  proposed,  semantic  reinterpretation  permits  arbitrary  refactoring  of  a  semantic 
specification  so  that  the  desired  outcome  can  be  achieved  via  reinterpretation  of  any  combina- 
tors  introduced.  In  contrast,  in  TSL — although  it  is  possible  to  introduce  combinators  and  refac¬ 
tor  them — the  primary  mechanism  is  to  reinterpret  the  base-types  and  meta-operators  of  the  TSL 
meta-language.  This  approach  is  particularly  convenient  for  a  system  to  generate  multiple  analysis 
components  from  a  single  specification  of  a  language’s  concrete  semantics. 

Semantic  Reinterpretation  Versus  Translation  to  a  Common  Intermediate  Representation. 

The  mapping  of  subject-language  constructs  to  meta-language  operations  that  one  defines  as  part 
of  the  semantic-reinterpretation  approach  resembles  a  translation  to  a  common  intermediate  repre¬ 
sentation  (CIR)  data  structure.  Thus,  another  approach  to  obtaining  “systematic”  reinterpretations 
that  are  similar  to  semantic  reinterpretations — in  that  they  apply  to  multiple  subject  languages — 
would  be  to  translate  subject-language  programs  to  a  CIR,  and  then  create  various  interpreters 
that  implement  different  abstract  interpretations  of  the  node  types  of  the  CIR  data  structure.  Each 
interpreter  would  then  be  applied  to  (the  translation  of)  programs  in  any  subject  language  L  for 
which  one  has  defined  an  L-to-CIR  translator.  Compared  with  interpreting  objects  of  a  CIR  data 
type,  the  advantages  of  semantic  reinterpretation  (i.e.,  reinterpreting  the  constructs  of  the  meta¬ 
language)  are 

1 .  The  presentation  of  our  ideas  is  simpler  because  one  does  not  have  to  introduce  an  additional 
language  of  trees  for  representing  CIR  objects. 

2.  With  semantic  reinterpretation,  there  is  no  explicit  CIR  data  structure  to  be  interpreted.  In 
essence,  semantic  reinterpretation  removes  a  level  of  interpretation,  and  hence  generated 
analyzers  should  run  faster. 

Micro-semantics  and  Macro-semantics 

Pleban  and  Lee  proposed  the  MESS  system,  a  prototype  implementation  of  a  compiler  genera¬ 
tor,  which  is  based  on  a  semantic-definition  style,  called  high-level  semantics  [154].  The  high-level 
semantics  was  designed  to  overcome  fundamental  problems  that  have  precluded  the  generation  of 
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realistic  compilers  from  traditional  denotational  specifications.  They  introduced  a  separation  of 
the  semantic  definition  of  a  programming  language  into  two  distinct  specifications,  called  macro¬ 
semantics  and  micro-semantics.  The  macro-semantics  of  a  language  is  defined  by  a  collection  of 
semantic  functions  that  map  syntactic  phrases,  compositionally,  to  terms  of  a  semantic  algebra. 
The  micro-semantics  specifies  the  meaning  of  a  semantic  algebra. 

3.5.2  Instruction-Set-Description  Languages 

There  have  been  many  specification  languages  for  instruction  sets  and  many  purposes  to  which 
they  have  been  applied.  Some  were  designed  for  hardware  simulation,  such  as  cycle  simulation 
and  pipeline  simulation  [152,  137].  Others  have  been  used  to  generate  an  emulator  for  compiler- 
optimization  testing  [113,  77].  TDL  [113]  is  a  hardware-description  language  that  supports  the 
retargeting  of  back-end  phases,  such  as  analyses  and  optimizations  relevant  to  instruction  schedul¬ 
ing,  register  assignment,  and  functional-unit  binding.  The  New  Jersey  machine-code  toolkit  [158] 
addresses  concrete  syntactic  issues  (instruction  decoding,  instruction  encoding,  etc.).  While  some 
of  the  existing  languages  would  have  been  satisfactory  for  our  purposes,  their  runtime  components 
were  not  satisfactory,  which  necessitated  creating  our  own  implementation. 

In  our  work,  we  needed  a  mechanism  to  create  abstract  interpreters  of  instruction-set  spec¬ 
ifications.  There  are  (at  least)  four  issues  that  arise:  during  the  abstract  interpretation  of  each 
transformer,  the  abstract  interpreter  must  be  able  to 

•  execute  over  abstract  states, 

•  execute  both  branches  of  a  conditional  expression, 

•  compare  abstract  states  and  terminate  abstract  execution  when  a  fixed  point  is  reached,  and 

•  apply  widening  operators,  if  necessary,  to  ensure  termination. 

As  far  as  we  know,  TSL  is  the  first  system  with  an  instruction-set-specification  language  and  sup¬ 
port  for  such  mechanisms. 

Although  this  chapter  only  discusses  the  application  of  TSL  to  low-level  instruction  sets,  we 
believe  that  only  small  extensions  would  be  needed  to  be  able  to  apply  TSL  to  source-code  lan¬ 
guages  (i.e.,  to  create  language-independent  analyzers  for  source-level  IRs),  as  well  as  bytecode. 
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The  main  obstacle  is  that  the  concrete  semantics  of  a  source-code  language  generally  uses  an  exe¬ 
cution  state  based  on  a  stack  of  variable-to- value  (or  variable-to-location,  location-to- value)  maps. 
For  a  low-level  language,  the  state  incorporates  an  address -based  memory  model,  for  which  the 
TSL  language  provides  appropriate  primitives. 

Functional  languages  as  instruction-set-description  language.  Harcourt  et  al.  used  ML  to 
specify  the  semantics  of  instruction  sets  [99].  LISAS  [71]  is  an  instruction-set-description  lan¬ 
guage  that  was  subsequently  developed  based  on  their  experience  using  ML.  Those  two  approaches 
particularly  influenced  the  design  of  the  TSL  language. 

A-RTL.  TSL  shares  some  of  the  same  goals  as  A-RTL  [157]  (i.e.,  the  ability  to  specify  the  seman¬ 
tics  of  an  instruction  set  and  to  support  multiple  clients  that  make  use  of  a  single  specification).  The 
two  languages  were  both  influenced  by  ML,  but  different  choices  were  made  about  what  aspects  of 
ML  to  retain:  A-RTL  is  higher-order,  but  without  datatype  constructors  and  recursion;  TSL  is  first- 
order,  but  supports  both  datatype  constructors  and  recursion.  As  discussed  in  §3. 2. 2. 2,  recursion  is 
not  often  used  in  specifications,  but  is  needed  for  handling  some  loop-iteration  instructions,  such  as 
the  IA32  string-manipulation  instructions  and  the  PowerPC  multiple-word  load/store  instructions. 
The  choices  made  in  the  design  and  implementation  of  TSL  were  driven  by  the  goal  of  being  able 
to  define  multiple  abstract  interpretations  of  an  instruction-sets  semantics. 

Insruction-Set  Processor  Specifications  (ISPS).  Siewiorek  et  al.  [170]  proposed  an  operational 
hardware  specification  language,  the  ISP  (Instruction-Set  Processor)  notation,  for  describing  the 
instructions  in  a  processor  and  how  they  are  implemented,  aiming  to  automate  the  generation  of 
software,  the  evaluation  of  computer  architectures,  and  the  certification  of  implementations. 

They  divide  a  computer  system  into  several  levels  including  the  program  level,  which  the  ISP 
notation  is  designed  to  properly  describe.  Their  design  of  the  ISP  notation  is  based  on  two  princi¬ 
ples:  (i)  the  effect  of  each  instruction  can  be  expressed  entirely  in  terms  of  the  information  held  in 
the  current  memory  (state);  the  components  of  the  program  level  are  a  set  of  memories  and  a  set 
of  operations.  The  ISP  notation  is  designed  for  specifying  that  a  given  operation  of  a  processor  is 
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performed  on  a  specific  data  structure  that  the  set  of  memories  hold,  and  (ii)  all  the  data  operations 
can  be  characterized  as  working  on  various  data- types',  each  data-type  requires  distinct  operations 
to  process  the  values  of  a  data-type.  A  processor  can  be  completely  described  at  the  ISP  level  by 
giving  its  instruction  set  and  its  interpreter  in  terms  of  its  operations,  data-types,  and  memories. 
TSL  relies  on  the  same  principles. 

3.5.3  Systems  for  Generating  Analyzers 

Some  systems  for  representing  and  analyzing  programs  are  (mainly)  targeted  for  a  single  lan¬ 
guage.  For  instance,  SOOT  [23]  is  a  powerful  and  flexible  analysis/optimization  framework  that 
supports  analysis  and  transformation  of  Java  bytecode.  One  method  to  support  the  retargeting  of 
analyses  to  different  languages  is  to  create  a  package  that  supports  a  family  of  program  analyses 
that  different  front  ends  can  use  to  create  analysis  components.  Examples  include  BDDBDDB 
[184],  Banshee  [117],  the  Parma  Polyhedra  Library  [21],  WPDS++  [115],  and  WALi  [114].  The 
writer  of  each  client  front  end  needs  to  encode  the  semantics  of  his  language  by  creating  appro¬ 
priate  transformers  for  each  statement  and  condition  in  the  language’s  IR,  using  the  package’s  API 
(or  input  language). 

WALA  [30]  supports  a  common  intermediate  form  (Common  Abstract  Syntax  Tree),  from 
which  multiple  additional  IRs  (e.g.,  CFGs  and  SSA-form)  can  be  generated,  and  multiple  analyses 
can  be  performed  that  use  these  IRs.  Thus,  this  is  similar  to  the  package  approach,  but  supports  a 
multiplicity  of  analyses. 

In  contrast  to  the  package  approach,  TSL  provides  a  domain-specific  language  for  specifying 
the  semantics  of  instruction  sets.  With  this  approach,  the  ISS  developer  concentrates  on  specifying 
the  concrete  operational  semantics  of  his  language,  using  TSL,  and  a  multiplicity  of  analyzers 
are  then  created  automatically.  Analysis  developers  can  incorporate  different  analysis  packages 
into  the  TSL  framework  by  implementing  appropriate  abstract  operations  that  over-approximate 
the  semantics  of  a  fixed  set  of  TSL  operations  (that  have  a  well-defined  semantics).  (Any  of  the 
aforementioned  packages  could  be  used  for  creating  TSL-based  analyses;  currently,  WALi  is  used 
for  all  of  the  TC-style  analyzers  that  have  been  developed  for  use  with  TSL  so  far.) 
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There  are  two  analysis  systems,  TVLA  [28]  and  the  optimizer  flow-function  inference  system 
developed  by  Rice  et  al.  [164],  in  which  sound  analysis  transformers  are  generated  automatically 
from  a  concrete  operational  semantics,  plus  a  specification  of  an  abstraction  (either  via  the  abstrac¬ 
tion  function  (TVLA)  or  the  concretization  function  (Rice  et  al.)).  In  our  system,  we  rely  on  the 
analysis  developer  to  supply  sound  abstract  operations.  While  this  places  an  additional  burden  on 
developers,  once  an  analysis  is  developed  it  can  be  used  with  each  instruction  set  specified  in  TSL. 
Moreover, 

•  The  analyses  that  we  support  are  much  more  efficient  than  those  that  can  be  created  with 
TVLA  and  apply  to  our  intended  domain  of  application  (low-level  code). 

•  Some  of  the  analyses  that  we  use,  such  as  ARA  [141],  appear  to  be  beyond  the  power  of  the 
heuristics-based  transformer-generation  methods  developed  by  Rice  et  al. 
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Chapter  4 

Symbolic  Analysis  via  Semantic  Reinterpretation 

The  use  of  symbolic-reasoning  primitives  for  forward  symbolic  evaluation,  weakest  liberal 
precondition  (WCP),  and  symbolic  composition  has  experienced  a  resurgence  in  program-analysis 
tools  because  of  the  power  that  they  provide  when  exploring  a  program’s  state  space. 

Model-checking  tools,  such  as  SLAM  [46]  and  BLAST  [102],  as  well  as  hybrid  con¬ 
crete/symbolic  program-exploration  tools,  such  as  DART  [94],  CUTE  [167],  YOGI  [98],  SAGE 
[95],  BITSCOPE  [54],  and  DASH  [49]  use  forward  symbolic  evaluation,  Yd  CP,  or  both.  An  im¬ 
portant  subroutine  in  these  tools  is  to  determine  the  following:  given  a  path  7 r  in  the  program,  is  7r 
feasible  (i.e.,  executable)? 

Given  path  ^ r,  symbolic  evaluation  is  used  to  construct  a  path  formula  f  for  ^ r  such  that  it  is 
feasible  if  and  only  if  is  satisfiable.  Moreover,  a  model  of  p  can  be  used  to  create  an  input  for 
the  program  that  causes  execution  to  follow  path  tt. 

Symbolic  evaluation  is  used  to  create  path  formulas.  To  determine  whether  a  path  tt  is  ex¬ 
ecutable,  an  SMT  solver  is  used  to  determine  whether  7r’s  path  formula  is  satisfiable,  and  if  so, 
to  generate  an  input  that  drives  the  program  down  tt.  Some  of  the  aforementioned  tools  also  use 
WCP  to  identify  new  predicates  that  split  part  of  a  program’s  state  space  [46,  49].  Proof-carrying 
code  systems  [145]  use  WCP  to  create  verification  conditions. 

Bug-finding  tools,  such  as  ARCHER  [187]  and  SATURN  [186],  as  well  as  commercial  bug¬ 
finding  products,  such  as  Coverity’s  PREVENT  [7]  and  GrammaTech’s  CODESONAR  [4],  use 
symbolic  composition.  Formulas  are  used  to  summarize  a  portion  of  the  behavior  of  a  procedure. 
Suppose  that  procedure  P  calls  Q  at  call-site  c,  and  that  r  is  the  site  in  P  to  which  control  returns 
after  the  call  at  c.  When  c  is  encountered  during  the  exploration  of  P,  such  tools  perform  the 
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symbolic  composition  of  the  formula  that  expresses  the  behavior  along  the  path  [entry p, . . . ,  c] 
explored  in  P  with  the  formula  that  captures  the  behavior  of  Q  to  obtain  a  formula  that  expresses 
the  behavior  along  the  path  [entry  P, . . . ,  r]. 

Motivation.  The  standard  approach  to  implementing  each  of  the  symbolic-analysis  primitives  for 
a  programming  language  of  interest  (which  we  call  the  subject  language  )  is  to  create  hand- written 
translation  procedures — one  per  symbolic-analysis  primitive — that  convert  subject-language  com¬ 
mands  into  appropriate  formulas.  Such  an  approach  can  be  extremely  tedious.  It  is  also  error  prone: 
a  system  can  contain  subtle  inconsistency  bugs  if  the  different  translation  procedures  adopt  differ¬ 
ent  “views”  of  the  semantics. 

One  manifestation  of  an  inconsistency  bug  would  be  that  if  one  performs  symbolic  evaluation 
of  a  path  7T  starting  from  a  state  that  satisfies  ip  =  WCV(n,  p),  the  resulting  symbolic  state  does 
not  entail  ip.  Such  bugs  undermine  the  soundness  of  an  analysis  tool. 

The  consistency  problem  is  compounded  by  the  issue  of  aliasing:  most  subject  languages  per¬ 
mit  memory  states  to  have  complicated  aliasing  patterns,  but  usually  it  is  not  obvious  that  aliasing 
is  treated  consistently  across  implementations  of  symbolic  evaluation,  WCV,  and  symbolic  com¬ 
position. 

Such  bugs  are  easy  to  introduce  because  each  translation  procedure  must  encode  the  subject 
language’s  semantics;  however,  the  encodings  for  symbolic  evaluation,  WCV,  and  symbolic  com¬ 
position  have  different  flavors. 

Our  own  interest  is  in  analyzing  machine  code,  such  as  x86  and  PowerPC.  Unfortunately,  as 
discussed  in  §2.4,  machine-code  instruction  sets  have  hundreds  of  instructions,  as  well  as  other 
complicating  factors,  such  as  the  use  of  separate  instructions  to  set  flags  (based  on  the  condition 
that  is  tested)  and  to  branch  according  to  the  flag  values,  the  ability  to  perform  address  arith¬ 
metic  and  dereference  computed  addresses  (hence  memory  states  can  have  complicated  aliasing 
patterns),  non-aligned  memory  accesses,  etc.  To  appreciate  the  need  for  tool  support  for  creat¬ 
ing  symbolic-analysis  primitives  for  real  machine-code  languages,  consult  the  Intel  instruction-set 
reference  manual  ([31,  §3.2]  and  [32,  §4.1]),  and  imagine  writing  three  separate  encodings  of 
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each  instruction’s  semantics  to  implement  symbolic  evaluation,  WCV ,  and  symbolic  composition. 
Some  tools  (e.g.,  [54,  95])  need  an  instruction-set  emulator,  in  which  case  a  fourth  encoding  of  the 
semantics  is  also  required. 

Our  approach.  To  address  these  issues,  this  chapter  presents  a  way  to  automatically  obtain 
mutually-consistent,  correct-by-construction  implementations  of  symbolic  primitives,  by  gener¬ 
ating  them  from  a  specification  of  the  subject  language’s  concrete  semantics. 

The  semantics  of  the  basic  symbolic-reasoning  primitives  are  easy  to  state;  for  instance,  if 
t(ct,  a')  is  a  2-state  formula  that  represents  the  semantics  of  an  instruction,  then  WCV(t:  p)  can 
be  expressed  as  Vor,.(r(<r,  a')  =>-  p(cr')).  However,  this  formula  uses  quantification  over  states — 
i.e.,  second-order  quantification — whereas  SMT  solvers,  such  as  Yices  [82]  and  Z3  [78],  support 
only  quantifier-free  first-order  logic.  Hence,  such  a  formula  cannot  be  used  directly. 

For  a  simple  language  that  has  only  int-valued  variables,  it  is  easy  to  recast  matters  in  first- 
order  logic.  For  instance,  the  WCV  of  postcondition  p  with  respect  to  an  assignment  statement 
var  =  rhs:  can  be  obtained  by  substituting  rhs  for  all  (free)  occurrences  of  var  in  p:  p[var  <—  rhs]. 
For  real-world  programming  languages,  however,  the  situation  is  more  complicated.  For  instance, 
for  languages  with  pointers,  Morris’s  rule  of  substitution  [138]  requires  taking  into  account  all  pos¬ 
sible  aliasing  combinations.  In  general,  tool  builders  need  to  create  implementations  of  symbolic 
primitives  for  full  languages,  and  hence  must  be  prepared  to  accommodate  whatever  features  the 
language  supports. 

We  present  a  method  to  obtain  quantifier-free,  first-order-logic  formulas  for  (a)  symbolic  eval¬ 
uation  of  a  single  command,  (b)  WCV  with  respect  to  a  single  command,  and  (c)  symbolic  com¬ 
position  for  a  class  of  formulas  that  express  state  transformations.  The  generated  implementations 
are  guaranteed  to  be  mutually  consistent,  and  also  to  be  consistent  with  an  instruction-set  emulator 
(for  concrete  execution)  that  is  generated  from  the  same  specification  of  the  subject  language’s 
concrete  semantics. 

Primitives  (a)  and  (b)  immediately  extend  to  compound  operations  over  a  given  program  path 
for  use  in  forward  and  backwards  symbolic  evaluation,  respectively;  see  §4.5.  (The  design  of  client 
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algorithms  that  use  such  primitives  to  perform  state-space  exploration  is  an  orthogonal  issue  that 
is  outside  the  scope  of  this  chapter.) 

Achievements  and  Contributions.  We  used  the  approach  described  in  the  paper  to  create  a 
“Y ACC-like”  tool  for  generating  mutually-consistent,  correct-by-construction  implementations  of 
symbolic-analysis  primitives  for  instruction  sets  (§4.7).  The  input  is  a  specification  of  an  instruc¬ 
tion  set’s  concrete  semantics;  the  output  is  a  triple  of  C++  functions  that  implement  the  three 
symbolic-analysis  primitives — (1)  translation  of  an  instruction  into  a  formula,  (2)  WCP  with  re¬ 
spect  to  an  instruction,  and  (3)  symbolic  composition.  The  tool  has  been  used  to  generate  such 
primitives  for  x86  and  PowerPC.  To  accomplish  this,  we  leveraged  TSL,  as  the  implementation 
platform  for  defining  the  necessary  reinterpretations. 

The  contributions  of  the  work  described  in  this  chapter  lie  in  the  insights  that  went  into  defining 
the  specific  reinterpretations  that  we  use  to  obtain  mutually-consistent,  correct-by-construction  im¬ 
plementations  of  the  symbolic-analysis  primitives,  and  the  discovery  that  WCP  could  be  obtained 
by  using  two  different  reinterpretations  working  in  tandem.  The  chapter’s  other  contributions  are 
summarized  as  follows: 

•  We  present  a  new  application  for  semantic  reinterpretation  (§4.1),  namely,  to  create  imple¬ 
mentations  of  the  basic  primitives  for  symbolic  reasoning  (§4.3  and  4.4).  In  particular,  two 
key  insights  allowed  us  to  obtain  the  primitives  for  W CP  and  symbolic  composition: 

-  The  first  insight  was  that  we  could  apply  semantic  reinterpretation  in  a  new  context, 
namely,  to  the  interpretation  function  of  a  logic  (§4.3). 

-  The  second  insight  was  to  define  a  particular  form  of  state-transformation  formula — 
called  a  structure-update  expression  (see  §4.2.1) — to  be  a  first-class  notion  in  the  logic, 
which  allows  such  formulas  (i)  to  serve  as  a  replacement  domain  in  various  reinterpre¬ 
tations,  and  (ii)  to  be  reinterpreted  themselves  (§4.3). 

•  We  show  how  reinterpretation  can  automatically  create  a  WCP  primitive  that  implements 
Morris’s  rule  of  substitution  for  a  language  with  pointers  [138]  (§4.3). 
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•  We  conducted  an  experiment  that  used  the  generated  symbolic -evaluation  primitive  on  real 
x86  code.  The  experiment  showed  that  using  an  exact  symbolic-evaluation  primitive,  as 
opposed  to  one  that  approximates  the  real  semantics,  is  slower  by  a  factor  of  1 .07  but  is 
dramatically  more  accurate  (§4.7). 

Moreover,  we  demonstrate  that  this  approach  to  creating  symbolic-analysis  primitives  can  handle 
languages  with  pointers  and  address  arithmetic  (§4.3  and  4.4).  For  expository  purposes,  simplified 
languages  are  used  throughout.  Our  discussion  of  machine  code  (§4.2.3  and  4.4)  is  based  on  a 
greatly  simplified  fragment  of  the  x86  instruction  set;  however,  our  implementation  (§4.7)  works 
on  code  from  real  x86  programs  compiled  from  C++  source  code,  including  C++  STL,  using  Visual 
Studio. 

Organization.  The  remainder  of  this  chapter  is  organized  as  follows:  §4.2  defines  the  logic  that 
we  use,  as  well  a  simple  source-code  language  (PL)  and  an  idealized  machine-code  language  (MC). 
§4.3  discusses  how  to  use  reinterpretation  to  obtain  the  three  symbolic-analysis  primitives  for  PL. 
§4.4  addresses  reinterpretation  for  MC.  §4.5  explains  how  other  language  constructs  beyond  those 
found  in  PL  and  MC  can  be  handled.  §4.6  describes  how  non-determinism  can  be  incorporated 
into  our  approach.  §4.7  describes  how  we  used  the  TSL  system  for  the  implementation,  and  also 
presents  the  experiment  carried  out  with  the  implementation.  §4.8  discusses  related  work.  §4.9 
presents  some  conclusions.  Correctness  proofs  can  be  found  in  Appendix  B. 

4.1  Semantic  Reinterpretation 

This  section  presents  the  basic  principles  of  semantic  reinterpretation  in  the  context  of  abstract 
interpretation.  We  use  a  simple  language  of  assignments,  and  define  the  concrete  semantics  and  an 
abstract  sign-analysis  semantics  via  semantic  reinterpretation. 

Example  4.1  [Adapted  from  [134].]  Consider  the  following  fragment  of  a  denotational  semantics, 
which  defines  the  meaning  of  assignment  statements  over  variables  that  hold  signed  32-bit  int 
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Si:  x  —  x  ©  y; 
(a)  s2:  y  =  x  ©  y; 


s3:  x  =  x  ©  y; 


1 1:  *px  =  *px  ©  *py, 
(b)  t2:  *py  =  *px  ©  *py ; 
t3:  *px  =  *px  ©  *py ; 


(c) 


[1]  mov  eax,  [ebp— 10] 

[2]  xor  eax,  [ebp— 14] 

[3]  mov  [ebp— 10]  ,  eax 

[4]  mov  eax,  [ebp— 10] 

[5]  xor  eax,  [ebp— 14] 

[6]  mov  [ebp— 14]  ,  eax 

[7]  mov  eax,  [ebp— 10] 

[8]  xor  eax,  [ebp— 14] 

[9]  mov  [ebp— 10]  ,  eax 

(d) 


Figure  4.1  (a)  Code  fragment  that  swaps  two  ints;  (b)  code  fragment  that  swaps  two  ints  using 
pointers;  (c)  possible  before  and  after  configurations  for  code  fragment  (b):  the  swap  is 
unsuccessful  due  to  aliasing;  (d)  x86  machine  code  (in  Intel  syntax)  corresponding  to  (a). 


values  (where  ©  denotes  exclusive-or): 

I  G  Id  E  6  Expr  ::=  /  |  Ei  ©  E2  \  . . . 

S  G  Stmt  ::=  I  =  E;  a  G  State  —  Id  — >  Int32 

£  :  Expr  — >  State  — >  lnt 32 

em<T  =  ai 

SlE1®E2ja  =  £lE1ja®£lE2ja 
Z  :  Stmt  — >  State  — >  State 
Zfl  =  E-,ja  =  a[I  »  SlEja] 

We  use  the  notation  “cr[J  i— >  v] to  mean  the  State  that  acts  like  a  except  that  argument  /  is  mapped 
to  v.  The  function  I  can  be  understood  as  an  interpreter  for  the  language:  (Xjsjcr)  is  the  state  that 
results  from  executing  statement  s  on  the  state  a.  A  sequence  of  statements  can  be  executed  by 
repeatedly  calling  Z.  For  instance,  consider  the  program  shown  in  Fig.  4.1(a),  which  swaps  two 
ints.  Execution  of  this  code,  starting  from  the  state  cr0  =  {a:  i — ^  — 1,  y  t— >  2}  can  be  achieved  as 
follows: 
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a0  :=  {x\-^-l,y^2} 

<?i  :  =  T[si  :  x  =  x®y;ja0  =  {x^-3,y^2} 

a2  :=  l\s2  :  y  =  x  ©  y;J<7i  =  {x  i->  -3,  y  !-*•  -1} 

a3  :=  J[s3  :  a;  =  a;  ©  y;!<72  =  {x  !-»•  2,  y  h->  -1} 

The  languages  derivable  from  Expr  and  State  define  the  subject  language.  The  semantics  is 
defined  using  a  meta-language.  In  this  example,  the  meta-language  has  one  base  type  ( Int32 ).  It 
supports  defining  map  types  (State  —  Id  — >  Int32)  and  user-defined  functions  (£  and  T).  It  also 
supports  operations  on  base-type  values  (e.g.,  map-access  operations  ( ai ),  map-update 

operations  (a  [I  ^  S[f?]a]),  and  invocation  of  user-defined  functions  (£\E\a). 

To  highlight  better  the  role  of  the  meta-language,  we  introduce  names  for  certain  aspects  of 
the  meta-language.  For  instance,  the  one  base  type,  whose  standard  interpretation  is  Int32,  will  be 
called  Val.  We  also  introduce  names  for  the  following  operators: 

•  xor  whose  standard  interpretation  is 

•  lookup ,  for  map-access  operations. 

•  store ,  for  map-update  operations. 

The  specification  given  earlier  is  thus  rewritten  as  follows: 

xor  :  Val  — >  Val  — >  Val 
lookup  :  State  Id  Val 
store  :  State  — >  Id  — >  Val  — >  State 

£  :  Expr  — >  State  — >  Val 

SJ/Jcr  =  lookup  a  I 

£\Ei  ©  E2ja  —  EfEi}a  xor  £\E2\a 

X  :  Stmt  — *  State  — >•  State 
X [/  =  i?;]cr  =  .^torc  cr  /  S[f?]<r 
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For  the  concrete  (or  “standard”)  semantics,  the  meta-language  types  and  operators  are  defined  as 
follows: 


v  G  Valstd  =  Inl32 
Statestd  —  Id  — >  Veil 


lookup  std  =  Xa.XI.al 
store std  =  Xo.XI .Xv.o[I  i— >•  v] 
xorstd  =  Avi.Av2.vi  ®  v2 


Different  abstract  interpretations  of  the  same  language  can  be  defined  by  using  the  same  semantic 
specification,  but  by  giving  different  interpretations  of  the  base  types,  function  types,  and  operators 
of  the  meta-language.  For  example,  for  sign  analysis,  assuming  that  Int32  values  are  represented 
in  two’s  complement,  the  meta-language  is  reinterpreted  as  follows:1 


v  G  Valabs 
State 
lookupabs 


storeabs 


xorabs 


{neg,zero,pos,  T} 
Id  Valabs 
Xa.XI.al 

Xa.  XI.  Xv.  a  [I  i— >•  v] 


V2 

neg 

zero 

pos 

T 

neg 

T 

neg 

neg 

T 

Vi 

zero 

neg 

zero 

pos 

T 

pos 

neg 

pos 

T 

T 

T 

T 

T 

T 

T 

Essentially,  this  redefines  (or  abstracts)  the  set  of  values  Valstd  to  Valabs  and  redefines  the  operators 
(like  xor )  to  operate  on  the  abstract  values. 

For  the  code  fragment  shown  in  Fig.  4.1(a),  sign-analysis  reinterpretation  creates  abstract  trans¬ 
formers  that,  given  the  initial  abstract  state  a0  =  {x  i— >•  neg,  y  >  pos},  produce  the  abstract  states 
shown  in  Fig.  4.2.  □ 


1For  the  two’s-complement  representation,  pos  xorabs  neg  =  neg  xorabs  pos  =  neg  because,  for  all  combinations 
of  values  represented  by  pos  and  neg ,  the  high-order  bit  of  the  result  is  set,  which  means  that  the  result  is  always 
negative.  However,  pos  xorabs  pos  =  neg  xorabs  neg  =  T  because  the  concrete  result  could  be  either  0  or  positive,  and 
zero  U  pos  =  T. 
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a0  :=  {x  i->  neg,  y  i— >  pos} 

<Ti  :=  I{si  :  x  =  x  ®  y,ja0  =  storeabs  o0  x  ( neg  xorabs  pos )  =  (z  i->  neg,  y  i->  pos} 

cr2  :=  X[s2  :  y  =  x  ©  p;]ai  =  store abs  oy  y  ( neg  xorabs  pos )  =  {x  i->  neg,  y  i->  neg} 

03  :=  J[s3  :  x  =  x  ©  y,jo2  =  store abs  o2  x  ( neg  xorabs  neg )  =  {x  i->  T,  y  i->  neg}. 

Figure  4.2  Application  of  the  abstract  transformers  created  by  the  sign-analysis  reinterpretation 
to  the  initial  abstract  state  cr0  =  {x  1— >•  neg,  y  1— >•  7905}. 

4.2  A  Logic  and  Two  Programming  Languages 

This  section  defines  quantifier-free  first-order  bit-vector  logic,  L,  a  simple  source-code  lan¬ 
guage,  PL,  which  only  has  int-valued  variables  and  pointer  variables,  and  a  simple  machine-code 
language  M C. 

4.2.1  L:  A  Quantifier-Free  Bit-Vector  Logic  with  Finite  Functions 

The  logic  L  is  quantifier- free  first-order  bit- vector  logic  over  a  vocabulary  of  constant  symbols 
(/  G  Id)  and  function  symbols  (F  G  Funcld).  Strictly  speaking,  we  work  with  various  instantia¬ 
tions  of  L,  denoted  by  L[PL]  and  L[MC],  in  which  the  vocabularies  of  function  symbols  are  chosen 
to  describe  aspects  of  the  values  used  by,  and  computations  performed  by,  the  programming  lan¬ 
guages  PL  and  MC,  respectively. 

We  distinguish  the  syntactic  symbols  of  L  from  their  counterparts  in  PL  (§4.1  and  4.2.2)  by 
using  boxes  around  U s  symbols. 

c  £  Cjnts2  =  (0, 1, . . .} 
op2L  G  BinOpL  =  (FIBW  •  •  •} 
roPi.  e  RelOp L  =  {[=], 0,@,0,---} 
bopL  G  BoolOpL  =  {  fefe  ,  j|  , . . .} 
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The  rest  of  the  syntax  of  L[-]  is  defined  as  follows: 

/  G  Id,  T  G  Term ,  99  G  Formula , 

F  G  Funcld,  FE  G  FuncExpr,  U  G  StructUpdate 

T  c  \  I  \  T1op2LT2  \  ite((p,TuT2 )  |  FE(T) 

(p  ::=[t|  I  [f]  1  Ti  ropL  T2  j  0  <Pi  |  (pibopLp2 
FE  ::=  F  |  FFi  [Ti  i->  F2] 

U  ■■■■=  ({Ii  ^  Ti},  {Fj  FEj}) 

A  Term  of  the  form  ite(p,  T 1 .  F2)  represents  an  if-then-else  expression.  Names  of  the  form 
F  G  Funcld ,  possibly  with  subscripts  and/or  primes,  are  function  symbols.  A  FuncExpr  of 
the  form  FE\  [F  1— >•  F2]  denotes  a  function-update  expression.  A  StructUpdate  of  the  form 
({/j  FFj})  is  called  a  structure -update  expression.  It  specifies  a  structure- 

transformation  operation  that  yields  a  structure  in  which  the  identifier  /,  is  updated  to  the  value 
of  term  Fj,  and  the  function  identifier  Fj  is  updated  to  the  value  of  function-expression  FEj. 
The  subscripts  i  and  j  implicitly  range  over  certain  index  sets,  which  will  be  omitted  to  re¬ 
duce  clutter.  To  emphasize  that  Ij  and  Fj  refer  to  next-state  quantities,  we  sometimes  write 
structure-update  expressions  with  primes:  ({/'  Fj},  (Fj  ^->  FEj}).  {/'  ^->  Fj}  specifies 
the  updates  to  the  interpretations  of  the  constant  symbols  and  (Fj  FEj}  specifies  the  updates 
to  the  interpretations  of  the  function  symbols  (see  below).  Thus,  a  structure-update  expression 
({/'  Fj},  (Fj  ^->  FEj})  can  be  thought  of  as  a  kind  of  restricted  2-vocabulary  (i.e.,  2-state) 
formula  /\?:(/'  =  Fj)  A  f\j(F'  =  FEj).  We  define  (4/  to  be 

({/'  ^  I  \  I  E  Id},{F'  ^  F  \  F  e  Funcld}). 

Semantics  of  L.  The  semantics  of  L[-}  is  defined  in  terms  of  a  logical  structure ,  which  gives 
meaning  to  the  Id  and  Funcld  symbols  of  the  logic’s  vocabulary: 

1  G  LogicalStruct  =  (id  — >  Vat)  x  ( Funcld  ( Val  — >  Vi//)). 

(tjl)  assigns  meanings  to  constant  symbols,  and  (z}2)  assigns  meanings  to  function  symbols. 
(“(p|  1)”  and  “(P'12)”  denote  the  1st  and  2nd  components,  respectively,  of  a  pair  p.) 
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const  :  CInt32  — *  Val 
cond l  :  BVal  — >  VfrZ  — »  VfrZ  — >  VfrZ 
lookupld  :  LogicalStruct  —*  Id  —>  Val 
binopL  :  BinOpL  — *  ( Val  x  VhZ  — >  Val) 
relopL  :  RelOpL  — >  (VfrZ  x  VfrZ  — ►  BVal) 
boolopL  :  BoolOpL  — *  (SVfrZ  x  BVal  — >  6 VhZ) 
lookup F wield  :  LogicalStruct  — >  Funcld  — >  {Val  — ^  Val) 
access  :  (VhZ  — ►  VfrZ)  x  VaZ)  — >  VfrZ 
update  :  ((ffrZ  — *  VhZ)  x  VfrZ  x  VfrZ)  — >  (VfrZ  — >  VfrZ) 

T  :  7erm  — ■>  LogicalStruct  — *•  VfrZ  F  :  Formula  LogicalStruct  BVal 

T[c]i  =  const{c)  F\  T  ||t  =  T 

T[[/]]i  =  lookupld  l  I  .FjJFji.  =  F 

HTi  °P2L  T2 ]  t  =  T[Ti]  t  WnopL (op2L)  T [T2]  t  -FjTi  ropL  T2]  t  =  T [Tj  t  reZopL (mpj  T [T2] t 

71,  T2)]t  =  c0ndL(^Mt,T[T1]t,r[T2]t) 

T\FE{T l)]i  =  access^  F\FE\l,T\Ti\l)  F\pibopLp2\i  =  boolop L{bop L)  Flp2ji 

F£  :  FuncExpr  — >  LogicalStruct  — >  (V«Z  ■»  Val) 

=  lookupFuncId  i  F 

FSIFE^T,  ^  T2]]t  =  «pdate(^£:[F£1lt)r[T1]t,r[T2]t) 

U  :  StructUpdate  — »  LogicalStruct  — >  LogicalStruct 
m{Ii  ^  Ti},  {Fj  /%■})].  =  ((tTl)[/<  ->  T[Tj]i],  (42)[^  ^ 

Figure  4.3  The  factored  semantics  of  L. 

The  factored  semantics  of  L  is  presented  in  Fig.  4.3.  Motivated  by  the  needs  of  later  sections, 
we  retain  the  convention  from  §4.1  of  working  with  the  domain  Val  rather  than  Int32.  Similarly, 
we  also  use  BVal  rather  than  Bool.  The  standard  interpretations  of  binopL,  relopL,  and  boolopL 
are  as  one  would  expect,  e.g.,  v2  binopL{[®\)  v2  =  Vi  xorv2,  etc.  The  standard  interpretations 
for  lookupIdstd  and  lookupFunc!dstd  select  from  the  first  and  second  components,  respectively,  of 
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a  Logic alStrucf.  lookupld std  l  I  =  1)(J)  and  lookupFuncIdstd  i  F  =  (/j2)(F).  The  standard 

interpretations  for  access  and  update  select  from,  and  store  to,  a  map,  respectively. 

Let  U  =  ({Jj  ^->  Tj},  {F,  FF,}).  Because  W[i7]t  retains  from  i  the  value  of  each  constant 

/  and  function  F  for  which  an  update  is  not  defined  explicitly  in  U  (i.e.,  I  E  (Id  —  {/.,})  and 
F  E  ( Funcld  —  {F,})),  as  a  notational  convenience  we  sometimes  treat  U  as  if  it  contains  an 
identity  update  for  each  such  symbol;  that  is,  we  say  that  (f/jl)J  =  /  for  I  E  (Id  —  {/*}),  and 
(U]2)F  =  F  for  F  e  (Funcld  -  {Fj}). 

4.2.2  PL  :  A  Simple  Source-Level  Language 

PL  is  the  language  from  §4.1,  extended  with  some  additional  kinds  of  int-valued  expressions, 
an  address-generation  expression,  a  dereferencing  expression,  and  an  indirect-assignment  state¬ 
ment.  Note  that  arithmetic  operations  can  also  occur  inside  a  dereference  expression;  i.e.,  PL 
allows  arithmetic  to  be  performed  on  addresses  (including  bitwise  operations  on  addresses:  see 
Ex.  4.2). 


c  E  C jnt 32  ■,  I  E  Id,  E  E  Expr,  BE  E  BoolExpr,  S  E  Stmt 
c  ::=  0  |  1  |  ... 

F  ::  =  c  |  /  |  &/  |  *E  \  E\  op2  F2  j  BE  ?  Ei  :  F2 
BE  ::  =  T  |  F  j  E-y  rop  F2  |  -> BE\  \  BE\  bop  BE2 
S  I  —  E;\  *1  —  E;\  Si52 

Semantics  of  PL.  The  factored  semantics  of  PL  is  presented  in  Fig.  4.4.  The  semantic  domain 
Loc  stands  for  locations  (or  memory  addresses).  We  identify  hoc  with  the  set  Val  of  values.  A  state 
a  E  State  is  a  pair  (r/,  p),  where,  in  the  standard  semantics,  environment  q  E  Env  —  Id  — >  Loc 
maps  identifiers  to  their  associated  locations  and  store  p  E  Store  =  Loc  — *  Val  maps  each  location 
to  the  value  that  it  holds. 
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v  G  Val 
l  G  Loc  =  Val 


£  :  Expr  State  Val 
£[c}a  =  const(c ) 

£|/]cr  =  lookupState  a  I 


V 

G  Env  = 

Id  Loc 

£\U\a  = 

p 

G  Store 

=  Loc  - 

->  Vh/ 

£l*Eja  = 

a 

G  State 

=  Store 

x 

£\Ei  op2  E2jo r  = 

£{ BE2E,  :  E2ja  = 

B  :  BoolExpr 

const 

'■  CInt32 

— >  Val 

sin*  = 

cond 

:  BVal  - 

->  Va/  — 

-  Val  - 

-*■  Val 

Bin*  = 

lookupState 

:  State 

— >  Id  — > 

Val 

B\E i  rap  E2]cr  = 

lookupEnv 

:  State 

—>  Id 

Loc 

= 

lookupStore 

:  State 

— *•  Loc  - 

->  Vh/ 

bop  BE2Ja  = 

updateStore 

:  State 

— *•  Loc  - 

-►  Val 

— >  State 

X 

:  Stmt 

— >  State 

— >  State 

BVal 


X\ I  =  E]}a  =  updateStore  a  ( lookupEnv  a  I)  (SJifJcr) 
T\*I  =  E;}a  =  updateStore  a  (£|[/]cr)  (5[£,]<t) 

US,  S2]a  =  llS^IlS^a) 

Figure  4.4  The  factored  semantics  of  PL. 


The  standard  interpretations  of  the  operators  used  in  the  PL  semantics  are 

BValstd  =  BVal 
Valstd  =  Int32 
Locs,d  =  Ini  3  2 
V  G  Envstd  =  Id  —>  Locstd 
p  G  Store std  BocSfd  *  Val 


113 


condstd  =  Xb.Xv1.Xv2 .  (bl  V\  :  V2) 
lookupStatestd  =  X(p,  p).XI  .p(p(I)) 
lookupEnv std  =  X  (p,  p).XI  :p(I) 
lookupStorestd  =  X  (p,  p).Xl.p(l) 
updateStorestd  =  X(p,  p).Xl.Xv.(p,  p[l  1— >•  v]) 

Handling  Computations  that  “Go  Wrong”.  In  accounts  of  axiomatic  semantics  [146]  and  re¬ 
lational  semantics  [171],  one  generally  considers  four  outcomes  of  an  execution:  an  execution 
terminates  (in  some  final  state),  goes  wrong ,  blocks,  or  diverges.  Because  we  are  only  providing 
the  semantics  of  individual  statements/instructions,  to  simplify  matters,  we  consider  only  semantic 
specifications  that  are  terminating.  This  eliminates  outcomes  that  block  or  diverge. 

We  sidestep  the  need  for  an  explicit  outcome  for  “goes  wrong”  by  introducing  an  additional 
BVal  variable  in  the  state,  isRunning,  which  is  set  to  false  to  model  computations  that  “go  wrong”. 
In  the  extended  semantics,  a  state  a  G  State  is  a  triple  (77,  p,  isRunning).  Fig.  4.5  shows  a  sketch 
of  how  to  add  the  semantics  of  the  outcome  for  “divide-by-zero”.  For  the  moment,  we  consider 
only  deterministic  specifications.  §4.6  discusses  how  we  handle  non-determinism. 

4.2.3  MC:  A  Simple  Machine-Code  Language 

MC  is  based  on  the  x86  instruction  set,  but  greatly  simplified  to  have  just  four  registers,  one 
flag,  and  four  instructions. 

r  G  register ,  do  G  dst -operand, 
so  G  src ^operand,  i  G  instruction 
r  ::=  eax  |  ebx  |  ebp  |  eip 
flag  Name  ::=  zf 

do::—  Indirect (r,  Val)  |  DirectReg(r) 
so  ::=  do  U  Immediate  (Val) 
instruction  ::=  mo v(do,  so)  \  cmp (do,  so) 

|  XOR(do,  so)  |  jz(rfo) 
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Loc  =  Val 
Env  =  Id  — >  Loc 
Store  =  Loc  — >  Val 
State  =  Store  x  £nv  x  BVal 

const  :  CIn,32  — >  Vo/ 
lookupState  :  State  — >  Id  — >  Val 
getlsRunning  :  (VI?/,  BVal)  — >  BVal 
lookupIsRunning  :  State  — > 
updatelsRunning  :  State  — >  BVal  — >  State 
getlsRunning  =  A  (v,b).b 
lookupIsRunning  =  A  (rj,p,b).b 
updatelsRunning  =  X(rj,  p,  b).\b'.(rj,  p,  b') 


£  :  Expr  — >  State  —>  [Val,  BVal) 

£[c]cr  =  (. const(c ),  T) 

£ |/]<7  =  ( lookupState  a  I,  T) 

=  (f  =  0) 

?  (1,  F) 

:  (£[-E'1]cr/£’[£,2](T,  T) 

X  :  Stmt  — »  State  — >  State 
X[/  =  -B;]cr  =  ( lookupIsRunning  a)  =  T 

?  ( getlsRunning  £[_B]cr)  =  T 

?  updateStore  a  ( lookup  Env  <J  I)  (£[-B]cr) 
:  updatelsRunning  <j  F 

:  <7 


Figure  4.5  An  extended  semantics  of  PL  to  accommodate  the  outcome  of  “divide-by-zero” 

execution. 


Semantics  of  MC.  The  factored  semantics  of  MC  is  presented  in  Fig.  4.6.  It  is  similar  to  the 
semantics  of  PL,  although  MC  exhibits  two  features  not  part  of  PL:  there  is  an  explicit  program 
counter  (eip),  and  MC  includes  the  typical  feature  of  machine-code  languages  that  a  branch  is  split 
across  two  instructions  (cmp  ...  jz).  An  MC  state  a  e  State  is  a  triple  (mem,  reg,flag ),  where 
mem  is  a  map  Val  — >  Val,  reg  is  a  map  register  — >  Val,  and  flag  is  a  map  (lagName  — >  BVal.  We 
assume  that  each  instruction  is  4  bytes  long;  hence,  the  execution  of  a  mov,  cmp,  or  XOR  increments 
the  program-counter  register  eip  by  4.  cmp  sets  the  value  of  zf  according  to  the  difference  of  the 
values  of  the  two  operands;  jz  updates  eip  depending  on  the  value  of  flag  zf . 


4.3  Symbolic  Analysis  for  PL  via  Reinterpretation 

A  PL  state  (r],p)  can  be  modeled  in  L[PL]  by  using  a  function  symbol  Fp  for  store  p,  and 
a  constant  symbol  cx  £  Id  for  each  PL  identifier  x.  (To  reduce  clutter,  we  will  use  x  for  such 
constants  instead  of  cx .)  Given  i  £  LogicalStruct,  the  constant  symbols  and  their  interpretations  in 
i  correspond  to  environment  rj,  and  the  interpretation  of  Fp  in  t  corresponds  to  store  p. 
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Const  .  C'int32 

-»  Val 

cond  : 

BVal  -> 

Val  -» 

Val  ->  Val 

lookup reg  :  State  - 

-2  register  — » 

Val 

lookupmem  : 

State  — > 

Val -h 

>  Val 

storereg  :  State  - 

->  register  — » 

Val  State 

store,,, cm  . 

State  — > 

La/^ 

>  Val  — >  State 

lookup p  :  State  - 

-> flagName  ■ 

->■  BVal 

increip  : 

State  — > 

State 

storeflag  :  State  - 

-> flagName  ■ 

— >  BVal  — >  State 

increip  = 

Xcr. store 

reg(cr,  eip,  "R-Je ip] cr 

1Z  :  reg  — >  State  - 

-y  Val 

O  : 

src joperand  - 

->  State  — 

■>  VhZ 

TZlrja  =  lookup reg(a,  r) 

K  .flagName  — >  State  — >  BVal 
/C[zfJcj  =  lookupflag(o,  zf ) 


0[/zzdz>ecf(r,  c)](j  =  lookupmem(a,  7?.[r]cj  f?z'zzop(-|-)  const{c )) 
(9[Dz'recf/?eg(r)]<7  =  77[r]<7 
(9 ^Immediate  ( c)  ]  a  =  const(c) 


X  :  instruction  .•  i.  State  — >■  State 

X[mov(/zzdz>ecf(r,  c),  so)]cr  =  increip(storemem(o,  TZfrJa binop(+)  const(c),  (9[so]<r)) 
X[mov(£)z>ec//?eg(r),  so)]cr  =  increip(storereg(a,  r,  (9[so]<r)) 

X[cmp(do,  so)]cr  =  increip(storeflag(o,  zf ,  C?[do](j  Wnop(— )  (9[so]cr  relop{=)  0)) 
T\KOR[do\Indirect[r ,  c),  so)]cr  =  increip(storemem(a ,  77[r]cT binop{+)  const(c),  O\do\o binop{®)  OJsoJcr)) 
X[AO/?(do:,Dz'recf/?eg(r),  so)]cr  =  increip(storereg(o,r ,  (9[do]cr  Wnop(©)  (9[so]<r)) 

X[jz(do)]cr  =  storereg(o ,  eip,  cond(K. [zf]cr,  (9[do]cr,  77[eip]cr  Wnop(-|-)  4)) 

Figure  4.6  The  factored  semantics  of  MC. 

Symbolic  Evaluation.  A  primitive  for  forward  symbolic-evaluation  must  solve  the  following 
problem:  Given  the  semantic  definition  of  a  programming  language,  together  with  a  specific  state¬ 
ment  s,  create  a  logical  formula  that  captures  the  semantics  of  s.  The  following  table  illustrates 
how  the  semantics  of  PL  statements  can  be  expressed  as  L[PL]  structure-update  expressions: 


PL 

L[PL] 

x  =  17; 

x  =  y, 

x  =  *q ; 

(0,  {e;  f„[x  « i7]» 

(0,  {f;  ejx  «  Fp(y)]}) 

(0,{F;^F„[x^F,(F,(q))]}) 

To  create  such  expressions  automatically  using  semantic  reinterpretation,  we  use  formulas  of  the 
logic  L  [PL]  as  a  reinterpretation  domain  for  the  meta-language  primitives  used  to  define  PL.  The 
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base  types  and  the  state  type  of  the  meta-language  are  reinterpreted  as  follows  (our  convention  is 
to  mark  each  reinterpreted  base  type,  function  type,  and  operator  with  an  overbar):  Veil  =  Term , 
BVal  =  Formula ,  and  State  =  StructUpdate.  The  operators  used  in  PL’s  meaning  functions  S,  B, 
and  X  are  reinterpreted  over  these  domains  as  follows: 

•  The  arithmetic,  bitwise,  relational,  and  logical  operators  are  interpreted  as  syntactic  con¬ 
structors  of  L[PL]  Terms  and  Formulas,  e.g., 


binop((B) 


XT1.XT2.Ti 


t2. 


Straightforward  simplifications  are  also  performed;  e.g.,  0  ©  a  simplifies  to  a,  etc.  Other 


simplifications  that  we  perform  are  similar  to  ones  used  by  others,  such  as  the  preprocessing 
steps  used  in  decision  procedures  (e.g.,  the  ite-lifting  and  read-over-write  transformations 
for  operations  on  functions  [89]). 


•  cond  residuates  an  ite(-,  •,  •)  Term  when  the  result  cannot  be  simplified  to  a  single  branch. 
The  other  operations  used  in  the  PL  semantics  are  reinterpreted  as  follows: 


lookupState 


lookupState 

lookupEnv 


lookupEnv 


lookupStore 

lookupStore 


updateStore 


updateStore 


StructUpdate  Id  Term 

XU.XI.((U]2)Fp)((Un)I) 

StructUpdate  — >  Id  — >  Term 
XU.XI  .{U]l)I 

StructUpdate  — >  Term  — *  Term 
XU  .XT  .({U]2)Fp)(T) 

StructUpdate  — >  Term  — >  Term 
— >  StructUpdate 

At/.AT1.AT2.((LT1),  (U]2)[Fp  ^  {(U]2)Fp)[T \  ^  T2}}) 


By  extension,  this  produces  functions  S,  B ,  and  I  with  the  types  shown  in  Fig.  4.7. 

In  particular,  given  a  StructUpdate  U,  function  1  translates  a  statement  s  of  PL  to  the 
StructUpdate  J[s](7  in  logic  L[ PL].  To  perform  symbolic  evaluation  along  a  path  t r,  one  starts 
with  the  StructUpdate  Uid  =  (0,  {F'p  Fp})  and  repeatedly  calls  function  X  with  the  next  state¬ 
ment  in  7 r  and  the  current  StructUpdate. 
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Standard  Reinterpreted 

£:  Expr  — >  State  — >  Vat  £:  Expr  — >  StructUpdate  — >  Term 

15:  BoolExpr  — >  State  — >•  BVal  B:  BoolExpr  — >  StructUpdate  — >  Formula 

1:  Stmt  — >  State  — »•  Stale  1:  Stmt  StructUpdate  — »•  StructUpdate 

Figure  4.7  Standard  types  of  the  PL  meaning  functions,  and  the  reinterpreted  types  used  to  obtain 

an  implementation  of  symbolic  evaluation. 


Ilx  =  x<BviUu=  (0,{^> 

-  F„[x  ~ 

=  (0.  {f;  - 

-Fp[x«(F(.(x)[®]Fp(y))]})  =  (71 

l{v  =  x®y;}Ui  =  (0,{F> 

J  fpI*  (^(x)  ©  ^p(y))][y  *-»■  ©  Blyju^}}) 

=  (0.  {f;  - 

-Fp[x^  (Fp(x)  ©  Fp(y))][y^  ((Fp(x)  ©  Fp( y))  ©  Fp(y))]}) 

=  (0.  {f;  - 

-  Fp[x  ^  (Fp(x)  ©  Fp(y))][y  ^  Fp(x)]})  =  U2 

Jlx  =  xer,p2  =  (»,{F^ 

J  FpI*  (£lx}U2  ©  £MU2)][y  ^  Fp(x)]}) 

=  (0.  {f;  - 

J  Frlx  *-»•  ((fp(x)  ©  Fp(y))  ©  Fp(x))][y  ^p(x)]}) 

=  (0.  {f;  - 

-  Fp[x  i->  Fp(y)][y  ^  Fp(x)]})  =  Uswap 

Figure  4.8  Symbolic  evaluation  of  Fig.  4.1(a)  via  semantic  reinterpretation,  starting  with  the 

StructUpdate  U =  (0,  (Fp  ^->  Fp}). 


Example  4.2  The  steps  of  symbolic  evaluation  of  Fig.  4.1(a)  via  semantic  reinterpretation,  starting 
with  Uid,  are  shown  in  Fig.  4.8.  The  resulting  StructUpdate ,  f/SH,ap,  can  be  considered  to  be  the  2- 
vocabulary  formula 

K  =  F4X  ^  Fr(y)][y  ^  Fp(x)L 

which  expresses  a  state  change  in  which  the  values  of  program  variables  x  and  y  are  swapped. 
Algebraic  simplification  plays  an  important  role.  For  example,  when  y  is  updated  in  Ui  by 

[y  ^  ((^(x)®^p(y))®^(y))] 

(see  Fig.  4.8),  the  update  is  simplified  to  [y  i— >  Fp(x)].  □ 
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Ui  =  (0,  {F'p  ^  Fp[ 0  i->  v\  \px  i->  py\  \py  ^  py ]}) 

ll*px  =  *px  ©  *py;jU1  =  (0,  {F'p  ^  Fp[ 0  !-►  v]\px  py}\py  ^  (S^pxJlh^S^pyJlh)]}) 

=  (0,  iFp  ^  fp[ 0  ^  v]\px  ^py]\py  ^  (py[®]py)}}) 

=  (0,  iF'p  ^  ^[0  !-»■  u]  [p^  ^  py]  \py  ►  0]})  =  U2 

F\ *py  =  *px  ©  *py,ju2  =  (0,  {Fp  ^  Fp{ 0  ^  (£l*pxjU2[®]Sl*pyjU2)}\px  ^  py}\py  i->  0]}) 

=  (0,  iFp  ^  fp[ 0  *-*•  (0 \®\v)}\px  ^  py]\py  i->  0]}) 

=  (0,  {^p  ^  FP[ 0  ^  ^py]\py  i->  0]})  =  f/3 

J[*pa;  =  *pa;  ©  *pp;]t/3  =  (0,  ^  ^p[0  ^  v]\px  ^ py]\py  (£l*PxJU3[®Wl*PylU3)}}) 

=  (0,  {Fp  ^  Fp[ 0  v]tpx  ^py][py  ^  (0[©]u)]}) 

=  (0,  {F'p  ^  Fp[ 0  i  ^  u]  \px  >-►  py]  [py  ^  v]})  =  U4 

Figure  4.9  Symbolic  evaluation  of  Fig.  4.1(b)  via  semantic  reinterpretation,  starting  with  a 
StructUpdate  that  corresponds  to  the  “Before”  column  of  Fig.  4.1(c). 

Example  4.3  To  illustrate  symbolic  evaluation  for  an  example  that  involves  pointers  and  pointer¬ 
dereferencing  operations,  Fig.  4.9  shows  the  steps  of  symbolic  evaluation  of  Fig.  4.1(b)  via  se¬ 
mantic  reinterpretation,  starting  with  a  StructUpdate  that  corresponds  to  the  “Before”  column  of 
Fig.  4.1(c).  The  program  from  Fig.  4.1(b)  works  correctly  when  there  is  no  aliasing;  however,  it 
does  not  always  work  correctly  when  started  from  the  kind  of  state  shown  in  the  “Before”  col¬ 
umn  of  Fig.  4.1(c).  The  StructUpdate  U4  obtained  via  our  symbolic-evaluation  primitive  can  be 
considered  to  be  the  2-vocabulary  formula 

F'p  =  Fp[ 0  ^  v]  \px  i— >•  py] [py  v], 

which  expresses  a  state  change  that  does  not  usually  perform  a  successful  swap.  The  example 
shows  that  the  symbolic-evaluation  method  can  faithfully  track  non-trivial  situations  that  involve 
pointer  aliasing.  □ 

The  correctness  of  our  method  for  performing  symbolic  evaluation  is  captured  by  the  following 


theorem: 
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Theorem  4.4  For  all  s  G  Stmt,  U  G  StructUpdate,  and  t  G  LogicalStruct,  the  meaning  of  T\s\U 
in  l  (i.e.,  W[X[s]f/]t)  is  equivalent  to  running  Ions  with  an  input  state  obtained  from  U\U\r. 
That  is, 

UlTlsjUjt  =  llsj(UlUji). 


Proof:  See  App.  B.l.  □ 

WLP.  WCP(s,  p)  characterizes  the  set  of  states  a  such  that  the  execution  of  s  starting  in  a 
either  fails  to  terminate  or  results  in  a  state  o'  such  that  p(crr)  holds.  For  a  language  that  only 
has  int-valued  variables,  the  WCV  of  a  postcondition  (specified  by  formula  <p)  with  respect  to  an 
assignment  statement  var  =  rhs:  can  be  expressed  as  the  formula  obtained  by  substituting  rhs  for 
all  (free)  occurrences  of  var  in  (p[var  <—  rhs] . 

For  a  language  with  pointer  variables,  such  as  PL,  syntactic  substitution  is  not  adequate  for 
finding  Yd  CP  formulas.  For  instance,  suppose  that  we  are  interested  in  finding  a  formula  for  the 
W  CP  of  postcondition  x  =  5  with  respect  to  *p  =  e;.  It  is  not  correct  merely  to  perform  the 
substitution  (x  =  5)[*p  e].  That  substitution  yields  x  —  5,  whereas  the  Yd  CP  depends  on  the 

execution  context  in  which  *p  =  e;  is  evaluated: 

•  If  p  points  to  x,  then  the  Y\J CP  formula  should  be  e  =  5. 

•  If  p  does  not  point  to  x,  then  the  WCV  formula  should  be  x  =  5. 

The  desired  formula  can  be  expressed  informally  as 

((p  =  hx)  ?  e  :  x)  —  5. 

For  a  program  fragment  that  involves  multiple  pointer  variables,  the  WCV  formula  may  have 
to  take  into  account  all  possible  aliasing  combinations.  This  is  the  essence  of  Morris’s  rule  of 
substitution  [138].  One  of  the  most  important  features  of  our  approach  is  its  ability  to  create 
correct  implementations  of  Morris’s  rule  of  substitution  automatically — and  basically  for  free. 
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Example  4.5  In  L[PL],  such  a  formula  would  be  expressed  as  shown  in  the  lower  row  below. 


Informal 

WCV(*p  —  e,  x  —  5)  —  ((p  —  &ix)  ?  e  :  x)  =  5 

L[PL] 

W£V(*p  =  e,  Fp(x)  [=]5)  =  ite(Fp(  p)  [^x,  Fp(e),  F0(x))[^]5 

In  Ex.  4.7,  we  will  show  how  the  latter  formula  is  created  via  semantic  reinterpretation.  □ 

To  create  primitives  for  W  CP  and  symbolic  composition  via  semantic  reinterpretation,  we 
again  use  L [PL]  as  a  reinterpretation  domain;  however,  there  is  a  trick:  in  contrast  with  what 
is  done  to  generate  symbolic-evaluation  primitives,  we  use  the  StructUpdate  type  of  L [PL] 
to  reinterpret  the  meaning  functions  U ,  FE,  F.  and  T  of  L [PL]  itself!  By  this  means, 
the  “alternative  meaning”  of  a  Term/ Formula! FuncExpr/ StructUpdate  is  a  (usually  different) 
Term/ Formula/ FuncExpr/ StructUpdate  in  which  some  substitution  and/or  simplification  has  taken 
place.  The  general  scheme  is  outlined  in  the  following  table: 


Meaning 

Functions 

Type 

Reinterpreted 

Replacement 

Type 

Function 

Created 

1,8,  B 

State 

StructUpdate 

Symbolic 

evaluation 

F,T 

LogicalStruct 

StructUpdate 

WCV 

U,FE,F,T 

LogicalStruct 

StructUpdate 

Symbolic 

composition 

In  §4.2.1,  we  defined  the  semantics  of  L\-\  in  a  form  that  would  make  it  amenable  to  semantic 
reinterpretation.  However,  one  small  point  needs  adjustment:  in  §4.2.1,  the  type  signatures  of 
Logic alStruct,  lookupFuncId,  access ,  update ,  and  FE  include  occurrences  of  Val  —?  Val.  This  was 
done  to  make  the  types  more  intuitive;  however,  for  reinterpretation  to  work,  an  additional  level  of 
factoring  is  necessary.  In  particular,  the  occurrences  of  Val  — >  Val  need  to  be  replaced  by  FVal. 
The  standard  semantics  of  FVcd  is  Val  — >  Val,  however,  for  creating  symbolic-analysis  primitives, 
FVal  is  reinterpreted  as  FuncExpr. 

The  reinterpretation  used  for  U,  IFF,  F ,  and  T  is  similar  to  what  was  used  for  symbolic  evalu¬ 


ation  of  PL  programs: 
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•  Val  =  Term,  BVal  =  Formula,  FVal  =  FuncExpr,  and  LogicalStruct  =  StructUpdate. 

•  The  arithmetic,  bitwise,  relational,  and  logical  operators  are  interpreted  as  syntactic  Term 
and  Formula  constructors  of  L,  e.g., 

binopL([®\)  =  ATi.AT2.Ti[©]T2, 

although  straightforward  simplifications  are  also  performed. 

•  condL  residuates  an  itef,  •,  •)  Term  when  the  result  cannot  be  simplified  to  a  single  branch. 

•  lookupld  and  lookupFuncId  are  resolved  immediately,  rather  than  residuated: 

-  lookupld  ({Ii  ^  Ti},{Fj  ^  FEj })  Ik  =  Tk 

-  lookupFuncId  ( { I,  ^->  T, } ,  {Fj  FEj})  Fk  =  FEk. 

•  access  and  update  are  discussed  below. 

By  extension,  this  produces  reinterpreted  meaning  functions  U,  IFF,  F,  and  T . 

Somewhat  surprisingly,  we  do  not  need  to  introduce  an  explicit  operation  of  substitution  for 
our  logic  because  a  substitution  operation  is  produced  as  a  by-product  of  reinterpretation.  In 
particular,  in  the  standard  semantics  for  L,  the  return  types  of  meaning  function  T  and  helper 
function  lookupld  are  both  Val.  However,  in  the  reinterpreted  semantics,  a  Val  is  a  Term — i.e., 
something  symbolic — which  is  used  in  subsequent  computations.  Thus,  when  t  e  LogicalStruct  is 
reinterpreted  as  U  G  StructUpdate,  the  reinterpretation  of  formula  p  via  F\jp\U  substitutes  Terms 
found  in  U  into  p:  F\jp\U  calls  T[T][7,  which  may  call  lookupld  U  /;  the  latter  would  return  a 
Term  fetched  from  U,  which  would  be  a  subterm  of  the  answer  returned  by  T\T\U,  which  in  turn 
would  be  a  subterm  of  the  answer  returned  by 

To  create  a  formula  for  WET  via  semantic  reinterpretation,  we  make  use  of  both  F.  the  rein¬ 
terpreted  logic  semantics,  and  1,  the  reinterpreted  programming-language  semantics.  The  WCV 
formula  for  p  with  respect  to  statement  s  is  obtained  by  performing  the  following  computation: 

WCV(s,p)  =Fip\(l{s\Uid).  (4.1) 


122 


Example  4.6  In  Ex.  4.2  and  Fig.  4.8,  we  derived  the  following  StructUpdate,  which  expresses  in 
L[ PL]  the  semantics  of  the  swap-code  fragment  swap  from  Fig.  4.1(a): 

Uswap  =  Z\swap\Uid 

=  (0,  iFp  ^  Fp[x  ^  Fp(y)][y  ^  Fp(x)]}). 

Using  the  method  given  in  Eqn.  (4.1),  we  obtain  the  following  Formula  of  L[PL]  for 
WCP{swap,Fp{x)\=]2)\ 

W£P(swap,Fp(x){=]  2) 

=  FlFp(x)\=]2\Uswap 
=  (T[Fp(x)]£^)[  =  ](r[2]£^) 

=  (access (FS [Fp] Uswap ,  T\x\UswaD))\^]{const{2)) 

(lookupFuncId  Uswap  Fp.  \  \ 

_ 

lookupld  Uswap  x  J  J 
=  (access(F0\x  Fp(y)][y  ^  i^(x)],  x))[=]2 
=  Fp(y)^2 

(To  understand  the  last  step,  see  the  discussion  of  access  below.)  □ 

To  understand  how  pointers  are  handled  during  the  WCP  operation,  the  key  reinterpretations 
to  concentrate  on  in  L  [PL]  are  the  ones  for  the  operations  of  the  meta-language  that  manipulate 
FVals  (i.e.,  arguments  of  type  Val  — >  Vat) — in  particular,  access  and  update.  We  want  access  and 
update  to  enjoy  the  following  semantic  properties: 

Tlaccess(FE0,T0)Ji  =  (F£IFEo}i)(TIT0}l) 

^I^te(EE0>r0,Ti)]t  =  (FS[FEoli)[T[Tolt,  »  TiT^i] 

Note  that  these  properties  require  evaluating  the  results  of  access  and  update  with  respect  to  an 
arbitrary  t  e  LogicalStruct.  As  mentioned  earlier,  it  is  desirable  for  reinterpreted  base-type 
operations  to  perform  simplifications  whenever  possible,  when  they  construct  Terms ,  Formulas , 


FuncExprs,  and  StructUpdate s.  However,  because  the  value  of  l  is  unknown,  access  and  update 
operate  in  an  uncertain  environment. 
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access(F,  k\  )  =  F(k\ ) 

d2  if  (h  =  k2) 

access  (FE,  k\)  if  ( k\  ^  k2) 

ite(k\[^\k2,  d2,  access (FE,  k\ ))  if  ( k\  =  k-2 ) 

update(F,  k\,  di)  =  F[k\  >  d±] 

FE[k\  i — *  di]  if  (ki  =  k2) 

update(FE[k2  i— >  d2\,  k±,  d\)  =  <  update(FE,  k\,  d\)[k2  i— >  d2]  if  (k\  ^  k2) 

FE[k2  i  ^  d2][/ci  l— d\]  if  (An  =  k2) 

Figure  4.10  Simplifications  performed  by  access  and  update.  The  operations  =,  =4,  and  =  denote 
equality-as-terms,  definite-disequality,  and  possible-equality,  respectively.  (The  possible-equality 
tests,  “fci  =  k2,  are  really  “otherwise”  cases  of  three-pronged  comparisons.) 

To  use  semantic  reinterpretation  to  create  a  WCV  primitive  that  implements  Morris’s  rule, 
simplifications  are  performed  by  access  and  update  according  to  the  definitions  given  in  Fig.  4.10. 
The  possible-equality  case  for  access  Fig.  4.10  introduces  ite  terms.  As  illustrated  in  Ex.  4.7,  it  is 
these  ite  terms  that  cause  the  reinterpreted  operations  to  account  for  possible  aliasing  combinations, 
and  thus  are  the  reason  that  the  semantic -reinterpretation  method  automatically  carries  out  the 
actions  of  Morris’s  rule  of  substitution  [138]. 

Example  4.7  We  now  demonstrate  how  semantic  reinterpretation  produces  the  L  [PL]  formula  for 
WCV(*p  —  e,  x  —  5)  claimed  in  Ex.  4.5. 

U  :=  2{*p  =  ejUId 

=  updateStore(UId ,  £{pj  UId,  £\e}  UId) 

=  updateStore(Uid ,  lookupStateltJ \d ,  p)  ,lookupState(Uid ,  e) 

=  updateS to re(Uid.  Fp(p),  Fp(e)) 

=  ((UIdn),{Fp^Fp[Fp(P)^Fp(e)}}) 


124 


WCV(*p  =  e,Fp(x){=]  5) 

=  TlFp(x)[=]5]U 
=  (T[Fp(x)]t/)^(T[5][/) 

=  ( access{T£lFp}U,Tlx}U))[=]5 
=  ( access{lookupFuncId(U ,  Fp),lookupId(U,  x)))  |  —  |5 
=  (flccess(Fp[Fp(p)  i  ^  -Fp(e)1,x))r^|5 
=  he(-Fp(p)  [=]x,  Fp(e),cwcess(Fp,  x))[==]5 
=  ^(^(p)  [==]x,  Fp(e),  Fp(x) )  r==]5 

Note  how  the  case  for  access  that  involves  a  possible-equality  comparison  causes  an  ite  term  to 
arise  that  tests  “F0(p)  The  test  determines  whether  the  value  of  p  is  the  address  of  x,  which 

is  the  only  aliasing  condition  that  matters  for  this  example.  □ 

Although  WCV  is  sometimes  confused  with  the  formula-manipulation  operations  used  to  ob¬ 
tain  a  formula  that  expresses  it,  or  with  the  formula  w  that  results,  WCV  is  really  a  semantic 
notion — the  set  of  states  described  by  For  example,  for  any  statement  s:  var  =  rhs ;  in  a  lan¬ 
guage  that  only  has  int-valued  variables,  and  postcondition  formula  ip ,  the  formula  p[var  <—  rhs] 
obtained  by  substitution  is  not  the  only  formula  that  expresses  WCV(s,  p).  In  fact,  there  are  an 
infinity  of  acceptable  formulas.  A  formula  is  acceptable  if  w  holds  in  the  pre-state  structure  / 
exactly  when  p  holds  in  the  post-state  structure  Z[s]t. 

Definition  4.8  (Acceptable  WCV  Formula)  if}  is  an  acceptable  formula  for  WC'P(s,  p)  iff,  for 
all  l  e  LogicalStruct , 

Zbja), 

where  o  is  the  State  that  corresponds  to  LogicalStruct  t  (i.e.,  a  =  ((/'ll),  (i]2)Fft);  see  Appendix 

B). 

The  correctness  of  the  WCV  primitive  defined  in  Eqn.  (4.1)  is  captured  by  the  following  theo¬ 
rem: 

Theorem  4.9  For  any  Stmt  s  and  Formula  p,  if)  V\p\  (T\s\Uid)  is  an  acceptable  WCV  formula 
for  p  with  respect  to  s. 
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=  w[(0,  (F'  Fp[x  ^  Fp(x)  ©  Fp(y)][y  ^  Fp( y)]})]f/lj2 


=  (0,(^2)  [Fp^FF[Fp[x^Fp(x)  ©  Fp(y)]  [y  ^  Fp(y)]]C/1>2]) 


=  c/« 


swap 


=  (0,  {f;  Ffi[F„[x  «  F„(x)  e  ^(y)][y^-Fp(y)]]Pi,2}) 


=  (0,  (Fp  update 


FF[Fp[x^Fp(x)®Fp(y)]]C/l!2, 
T[y]C/i;2, 

\TK(y)lC/,,2 


\ 


}) 


/ 


=  (0.  {K  -  (FElFpl*  "  F„ (x)  ©  Fp(y)]]f/,,2)[y  «  F,(=t)]}) 


=  (0,  (Fp  update 


( FE\Fp\u1)2 , 


twc/1)2, 


[y  ^  Fp(x)]}) 


=  (0,  {F;  (Fp[x  -  T[Fp(x)@Fp(y)]f/li2][y  ->  Fp(x)])[y  -  Fp(x)]}) 
=  (MFp  ^  ^p[x  ^  ((Fp(x)^Fp(y))r©]Fp(x))][y  ^  Fp(x)]}) 


=  (0,  {F;  Fp[x  -  Fp(y)][y  -  Fp(x)]}) 


Figure  4.11  Example  of  symbolic  composition. 


Proof:  See  App.  B.2.  □ 

Symbolic  Composition.  The  goal  of  symbolic  composition  is  to  have  a  method  that,  given  two 
symbolic  representations  of  state  changes,  computes  a  symbolic  representation  of  their  composed 
state  change.  In  our  approach,  each  state  change  is  represented  in  logic  L [PL]  by  a  StructUpdate, 
and  the  method  computes  a  new  StructUpdate  that  represents  their  composition.  To  accomplish 
this,  L [PL]  is  used  as  a  reinterpretation  domain,  exactly  as  for  W  CP.  Moreover,  U  turns  out  to  be 
exactly  the  symbolic -composition  function  that  we  seek.  In  particular,  U  works  as  follows: 

Uldf^T^^Fj^FEj}) jU  =  mWi  -  T[Fl]f/],(f/T2)[FJ.  ^  lF£\FEj\U]) 

Example  4.10  At  the  syntactic  level,  we  can  demonstrate  the  ability  of  U  (plus  simple  algebraic 
simplification)  to  perform  symbolic  composition  by  showing  that  for  the  swap-code  fragment  from 
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Fig.  4.1(a) 

T [si;  s2;  s3jUid  =  W[J[33l^l(X[si;  s2\Uid). 

First,  consider  the  left-hand  side.  As  shown  in  Fig.  4.8,  Z[si;  s2;  s3}Uid  =  {%,F'  ^->  Fp[x  i— > 
Fp(y)][y  i — >  Fp(x)])  =  Uswap.  Now  consider  the  right-hand  side.  Let  t/ii2  and  U3  be  defined  as 
follows: 

Ult2=llsr,s2}Uid 

=  (MFp  ^  Fp[x  ^  Fp (x) [~®~| Fp (y)] [y  ^  Fp(x)]}) 

U3  =  Tls3jUid 

=  (0>{Fp  ^  Fp[x  f-4  Fp(x)[®]Fp(y)][y  ^  Fp(y)]}). 

As  shown  in  Fig.  4.1 1, 

BlftPu  =  (0.  {K  ^  "  F«(y)][y "  F,W]})- 

Therefore,  U\U3}Uia  =  Us„w.  □ 

The  semantic  correctness  of  the  symbolic-composition  primitive  U  is  captured  by  the  following 
theorem,  which  shows  that  the  meaning  of  £V[f72]f7i  is  the  composition  of  the  meanings  of  U2  and 

Up. 

Theorem  4.11  For  all  U\,  U2  G  StructUpdate, 

U\UiU2\Ull=UiU2\  O  Ullhl 

Proof:  See  App.  B.3.  □ 

4.4  Symbolic  Analysis  for  MC  via  Reinterpretation 

To  obtain  the  three  symbolic-analysis  primitives  for  MC,  we  use  a  reinterpretation  of  MC’s 
semantics  that  is  essentially  identical  to  the  reinterpretation  for  PL,  modulo  the  fact  that  the  seman¬ 
tics  of  PL  is  written  in  terms  of  the  combinators  lookupEnv,  lookupStore,  and  updateStore,  whereas 
the  semantics  of  MC  is  written  in  terms  of  lookup reg,  store reg,  lookup flag,  storedag,  lookup mem,  and 
store  mem. 
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Symbolic  Evaluation.  The  base  types  are  redefined  as  BVal  =  Formula ,  Val  =  Term,  State  = 
StructUpdate,  where  the  vocabulary  for  LogicalStructs  is 

({zf ,  eax.  ebx,  ebp,  eip},  {Fmem}). 

Lookup  and  store  operations  for  MC,  such  as  lookupmem  and  store mem,  are  handled  the  same  way 
that  lookup  Store  and  update  Store  are  handled  for  PL. 

lookupmem  :  StructUpdate  — >  Term  — >  Term 
lookupmem  =  XU.XT.((U]2)Fmem)(T) 
store mem  :  StructUpdate  — >  Term  — >  Term  — »  StructUpdate 
storemem  =  XU.XT1.XT2. 

((f/fl),  ( U]2)[Fmem  (( U'[2)Fmem)[Ti  1— >  T2]]) 

Iookupreg  :  StructUpdate  — >  register  Term 

lookup  reg  =  A[/.Ar.([/|l)(r) 
storereg  :  StructUpdate  — >  register  — >  Term 
— >  StructUpdate 

StdJTreg  =  XU.Xr.XT.((Un)[r  T],  (C7|2)) 

Because  we  placed  zf  in  the  set  of  constant  symbols  (which  denote  7«L?2  values),  we  use  the 


following  definitions  of  lookup ^  and  store fiag,  where  in  storeflag  the  /»/72  values  1  and  0  encode  T 
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and  F,  respectively.2 


lookup pag  :  StructUpdate  — >  flagName  —>  Formula 

Twk^flag  =  \u.\f.{{un)U)\E]V 

storeflag  :  StructUpdate  — >  flagName  — >  Formula 
—>  StructUpdate 

storeflag  =  XU.Xf  .Xtp.((U^l)[f  ite(ip,  1,0)],  (0 2)) 

Example  4.12  Fig.  4.1(d)  shows  the  MC  code  that  corresponds  to  the  swap  code  in  Fig.  4.1(a): 
lines  1-3,  lines  4-6,  and  lines  7-9  correspond  to  lines  1,  2,  and  3  of  Fig.  4.1(a),  respectively. 
For  the  MC  code  in  Fig.  4.1(d),  Xuc\swap\Uid,  which  denotes  the  symbolic  evaluation  of  swap, 
produces  the  StructUpdate 


( {eax'  Fmem(ebp  0  14)}, 


\ 


Fmem  Pnem  [ekP  0  10 
[ebp  0  14 


U'mem  (ebp  □  14)] 
Fmem  (ebp  0  10)] 


) 


Fig.  4.1(d)  illustrates  why  it  is  essential  to  be  able  to  handle  address  arithmetic:  an  access  on 
a  source-level  variable  is  compiled  into  machine  code  that  dereferences  an  address  in  the  stack 
frame  computed  from  the  frame  pointer  (ebp)  and  an  offset.  This  example  shows  that  JMC  is  able 
to  handle  address  arithmetic  correctly.  □ 


WLP.  To  create  a  formula  for  the  VVC'P  of  ip  with  respect  to  instruction  i  via  semantic  rein¬ 
terpretation,  we  use  the  reinterpreted  MC  semantics  ZMc,  together  with  the  reinterpreted  L  [MC] 
meaning  function  JFMC,  where  JFMC  is  created  via  the  same  approach  used  in  §4.3  to  reinterpret 
L[ PL].  WCV{i ,  p)  is  obtained  by  performing  Z’MC[<^](ZMC[i]t/iy). 

2To  simplify  the  exposition,  L  is  intentionally  a  limited  logic  over  values  of  type  Int32.  To  define  lookup and 
storeflag ,  it  would  be  more  convenient  to  use  a  logic  with  Boolean-valued  constant  symbols  Bj  £  Boolld,  in  which 
case  a  StructUpdate  would  be  a  triple  of  the  form 

({7,:  Ti},  {Bj  ifij},  {Fk  FEk}), 

and  lookup flag  and  storeflag  could  be  defined  as  follows: 


lookup flag  =  XU.Xf  ,(U]2)(f) 

StdFeflag  =  XU.Xf.Xip.((Un),  ( U]2)[f  -  0  (03)) 
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[1]  void  foo(int  e,  int  x,  int*  p) 

[2] 

[3]  *p  =  e; 

[4]  if  (x  ==  5) 

[5]  goto  ERROR; 

[6]  } 

(a) 


{  [1]  mov  eax,  p; 

[2]  mov  ebx,  e; 

[3]  mov  [eax]  ,  ebx; 

[4]  cmp  x ,  5 ; 

[5]  jz  ERROR; 

[6]  ... 

[7]  ERROR:  . . . 

(b) 


Figure  4.12  (a)  A  simple  source-code  fragment  written  in  PL;  (b)  the  MC  code  for  (a). 


Example  4.13  Fig.  4.12(a)  shows  a  source-code  fragment;  Fig.  4.12(b)  shows  the  corresponding 
MC  code.  (To  simplify  the  MC  code,  source-level  variable  names  are  used.)  In  Fig.  4.12(a),  the 
largest  set  of  states  just  before  line  [3]  that  cause  the  branch  to  ERROR  to  be  taken  at  line  [4]  is 
described  by  W£V(*p  —  e,x  —  5).  In  Fig.  4.12(b),  an  expression  that  characterizes  whether 
the  branch  to  ERROR  is  taken  is  ,  (eip|^]c[7] )),  where  S[i]-[5]  denotes  instructions 

[1]  —  [5]  of  Fig.  4.12(b),  and  C[7]  is  the  address  of  ERROR.  Using  semantic  reinterpretation, 

F MC  I(e IP  Ell  C [7]  )  1  (^MC  |s  [1]  -  [5]  ]  U id) 


produces  the  formula 


( / tC (  ( Fmem  ( P )  |  |  x)  )  F'mem  (e)  5  Fmem  W)05)E]o, 


which,  transliterated  to  informal  source-level  notation,  is  {({p  =  Szx)  ?  e  :  x)  —  5)  =  0. 

Even  though  the  (source-level)  branch  is  split  across  two  instmctions  in  Fig.  4.12(b),  WCV 
can  be  used  to  recover  the  branch  condition.  First, 

W’£7:,(cmp  x,5;  jz  ERROR,  (eip[=]C[7] )) 


returns  the  formula 


ite(((Fmem(x)  R5)[^10),  c[7] ,  c[6] )  ^c[7] , 
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as  shown  by  the  following  derivation: 


XMc[cmp  x,5\Uid  =  ({zf'  tfe((-F»,g,„(x)  p]5)  [==]0, 1,  0)},  0) 

=  Ui 


XMc[jz  ERROR] U\  = 


f 

zf'  ite((F„,a„(x)n5)R0, 1,0) 

\ 

< 

((Fmgm(x)  p]5)  p=]0). 

>  ,0 

eip'  <— 1  ite 

C[  7], 

V 

yC[6]  ) 

/ 

=  u2 


•^Mc[eip[^1  cm}U2  =  ite 


\(Fmem  (x)  0  5)  [=]  0) 

C[  7],  \Elcm 

\C[6] 

Second,  because  c[7]  ^  c[6] ,  the  formula  in  the  last  line  simplifies  to  (Fmem(x)  P|5)  |^]0;  i.e.,  in 
source-level  terms,  (x  —  5)  —  0.  □ 


Symbolic  Composition.  For  MC,  symbolic  composition  can  be  performed  using  UM c- 

4.5  Other  Language  Constructs 

Branching.  Ex.  4.13  illustrated  a  W  CP  computation  across  a  machine-code  branch  instruction. 
We  now  illustrate  forward  symbolic  evaluation  across  a  branch. 

Example  4.14  Suppose  that  an  if-statement  is  represented  by 

If  Stmt  (BE,  Int32,  Int32) , 

where  BE  is  the  condition  and  the  two  Int32s  are  the  addresses  of  the  true-branch  and  false- 
branch,  respectively.  Its  factored  semantics  would  specify  how  the  value  of  the  program  counter 
PC  changes: 


X\IfStmt(BE ,  ct,  cf)]ct  =  updateStore  a  PC  cond(B\BE\a ,  const(cT) ,  const(cp)) ■ 
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Formula  O b t a i  n  P a t h C o n  s t ra i  n t F o rmul  a( Pa t h  n)  { 

Formula  p  =  _T_;  //  Initial  path-constraint  formula 

StructUpdate  U  =  Uuj]  II  Initial  symbolic  state-transformer 

let  [PCi  :  ii,  PC2  :  i2, - ,  PCn  :  in,  PCn+i  :  skip]  =  n  in 

for  (k  =  1  ;  k  <  n\  k++ )  { 

U  =  Z[4][7;  // Symbolically  execute  4 

p  =  p  fcfc  JF[PC  [=]PCfe+i]C/;  //  Conjoin  the  branch  condition  for  ik 

} 

return  p; 

} 

Figure  4.13  An  algorithm  to  obtain  a  path-constraint  formula  that  characterizes  which  initial 

states  must  follow  path  n. 


(a)  (b) 


Figure  4.14  Conversion  of  a  recursively  defined  instruction — portrayed  in  (a)  as  a  “microcode 
loop”  over  the  actions  denoted  by  the  dashed  circles  and  arrows — into  (b),  an  explicit  loop  in  the 
control-flow  graph  whose  body  is  an  instruction  defined  without  using  recursion.  The  three 
microcode  operations  in  (b)  correspond  to  the  three  operations  in  the  body  of  the  microcode  loop 

in  (a). 

In  the  reinterpretation  for  symbolic  evaluation,  the  StructUpdate  U  obtained  by 
T\IfStmt{BE,  cT,  cF)}Uid  would  be  ({PC'  ite{pBE ,  cTl  cF)},  0),  where  pBE  is  the  Formula  ob¬ 
tained  for  BE  under  the  reinterpreted  semantics.  To  obtain  the  branch  condition  for  a  specific 
branch,  say  the  true-branch,  we  evaluate  JF[PC[==]ct]£/.  The  result  is  (ite(pBE,  cE,  cf)  |~=~|cr), 
which  (assuming  that  ct  ^  cp)  simplifies  to  pBE.  (A  similar  formula  simplification  was  performed 
in  Ex.  4.13  on  the  result  of  the  WCP  formula.) 


132 


□ 

Loops.  One  kind  of  intended  client  of  our  approach  to  creating  symbolic-analysis  primitives  is 
hybrid  concrete/symbolic  state-space  exploration  [94,  167,  95,  54].  Such  tools  use  a  combination 
of  concrete  and  symbolic  evaluation  to  generate  inputs  that  increase  coverage.  In  such  tools,  a 
program-level  loop  is  executed  concretely  a  specific  number  of  times  as  some  path  7r  is  followed. 
The  symbolic-evaluation  primitive  for  a  single  instruction  is  applied  to  each  instruction  of  7r  to 
obtain  symbolic  states  at  each  point  of  ^ r.  A  path-constraint  formula  that  characterizes  which 
initial  states  must  follow  ix  can  be  obtained  by  collecting  the  branch  formula  <pBE  obtained  at  each 
branch  condition  by  the  technique  described  above;  the  algorithm  is  shown  in  Fig.  4.13. 

X86  String  Instructions.  X86  string  instructions  can  involve  actions  that  perform  an  a  priori 
unbounded  amount  of  work  (e.g.,  the  amount  performed  is  determined  by  the  value  held  in  register 
ecx  at  the  start  of  the  instruction).  This  can  be  reduced  to  the  loop  case  discussed  above  by  giving 
a  semantics  in  which  the  instruction  itself  is  one  of  its  two  successors.  In  essence,  the  “microcode 
loop”  is  converted  into  an  explicit  loop  (see  Fig.  4.14). 

Procedures.  A  call  statement’s  semantics  (i.e.,  how  the  state  is  changed  by  the  call  action)  would 
be  specified  with  some  collection  of  operations.  Again,  the  reinterpretation  of  the  state  transformer 
is  induced  by  the  reinterpretation  of  each  operation: 

•  For  a  call  statement  in  a  high-level  language,  there  would  be  an  operation  that  creates  a 
new  activation  record.  The  reinterpretation  of  this  would  generate  a  fresh  logical  constant  to 
represent  the  location  of  the  new  activation  record. 

•  For  a  call  instruction  in  a  machine-code  language,  register  operations  would  change  the 
stack  pointer  and  frame  pointer,  and  memory  operations  would  initialize  fields  of  the  new 
activation  record.  These  are  reinterpreted  in  exactly  the  same  way  that  register  and  memory 
operations  are  reinterpreted  for  other  constructs. 


Dynamic  Allocation.  Two  approaches  are  possible: 
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•  The  allocation  package  is  implemented  as  a  library.  One  can  apply  our  techniques  to  the 
machine  code  from  the  library. 

•  If  a  formula  is  desired  that  is  based  on  a  high-level  semantics,  a  call  statement  that  calls 
malloc  or  new  can  be  reinterpreted  using  the  kind  of  approach  used  in  other  systems  (a  fresh 
logical  constant  denoting  a  new  location  can  be  generated). 

4.6  Incorporating  Non-Determinism 

Many  formalisms  for  symbolic  analysis  of  programs  support  the  use  of  non-determinism, 
which  is  useful  for  writing  “harness  code”  (code  that  models  the  possible  client  environments  from 
which  the  code  being  analyzed  might  be  called),  as  well  as  for  modeling  the  possible  inputs  to  a 
program.  A  common  approach  is  to  provide  a  primitive  that  returns  an  arbitrary  value  of  a  given 
type.  Examples  include  the  SdvMakeChoice  primitive  of  SLAM  [46]  and  the  havoc  (x)  primitive 
of  BoogiePL  [48].  In  this  section,  we  discuss  adding  such  a  primitive,  CALL  randlnt32,  to  MC. 
CALL  randlnt32  is  an  instruction  that  assigns  an  arbitrary  value  to  register  eax.3  We  refer  to  MC 
extended  with  CALL  randlnt32  as  NDMC. 

This  section  describes  how  implementations  of  the  basic  primitives  used  in  symbolic  program 
analysis  are  obtained  for  NDMC.  (Essentially  the  same  method  can  be  applied  to  a  version  of  PL 
extended  with  its  own  primitive  for  generating  an  arbitrary  Int32  value.) 

Because  our  approach  to  creating  implementations  of  the  primitives  used  in  symbolic  pro¬ 
gram  analysis  is  based  on  semantic  reinterpretation,  our  goal  is  to  give  a  concrete  semantics  for 
CALL  randlnt32  whose  reinterpretation  produces  the  desired  effect.  At  an  intuitive  level,  we  would 
like  to  treat  each  invocation  of  CALL  randlnt32  as  reading  the  next  input  value,  and  have  the  se¬ 
mantics  of  the  program  arrange  to  record  all  of  the  input  values.  To  carry  out  something  equivalent 
to  this,  we  assume  that  the  meta-language  in  which  semantic  specifications  are  written  supports  a 
primitive  for  creating  a  random  map ,  which  is  a  map  initialized  with  arbitrary  values.4  Rather  than 
recording  input  values,  we  will  materialize — in  a  random  map  that  is  part  of  the  input  state — the 

3  In  the  x86  instruction  set,  register  eax  is  used  to  pass  back  the  return  value  from  a  function  call. 

4 A  random  map  is  easy  to  model  in  logic  L  using  a  function  that  is  unconstrained. 
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sequence  of  non-deterministic  values  that  eax  will  receive  on  successive  calls  to  CALL  randlnt32. 
The  state  will  also  contain  an  index-variable,  which  indicates  the  index  of  the  next  choice.  Thus, 
all  non-determinism  in  the  concrete  semantics  is  pushed  onto  the  initialization  of  the  random  map 
in  the  initial  state;  all  transitions  thereafter  are  deterministic. 

The  CALL  randlnt32  instruction  and  its  semantics  are  defined  as  an  extension  of  the  MC  lan¬ 
guage  presented  in  §4.2.3: 

instruction  :=...(  CALL  randlnt32 

An  NDMC  state  is  defined  in  terms  of 

choiceMap  e  Veil  — >  Veil 
choicelndex  G  Veil 

and  an  NDMC  state  o  e  State  is  now  a  quintuple 

(mem,  reg.jlag,  choiceMap ,  choicelndex), 

where  choiceMap  is  a  random  map. 

lookup choiceMap  :  State  -»•  Veil 
lookup  choiceMap 

A  (mem,  reg.flag ,  choiceMap ,  choicelndex). choiceMap(choicelndex) 


inev choicelndex  •  State  *  State 

inCV choicelndex 

A  (mem,  reg,flag,  choiceMap,  choicelndex ) 

.(mem,  reg,flag,  choiceMap,  choicelndex  +  1) 
The  concrete  semantics  of  CALL  randlnt32  is  defined  as  follows: 


X\CALL  randlnt32\(j 


t  hner choicelndex  (a),  ^  ^ 


=  incr. 


eip 


store, 


reg 


eax. 


y lookup  choiceMap  {a)  ) 


) 
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Reinterpretation  in  Logic.  As  before,  State  is  reinterpreted  as  a  StructUpdate:  State  = 
StructUpdate,  where  the  vocabulary  for  LogicalStructs  is 


and  Ujd  is 


/  {choicelndex,  zf ,  eax,  ebx,  ebp,  eip}, 

\  {  F choiceMap  •  Fmem  } 

{choicelndex  choicelndex ,  zf'  Hzf,...}, 

.  I  ^ choice  Map  ^  FchoiceMap  i  FJnem  Fmem  } 


VVC'P  in  the  Presence  of  Non-Determinism.  In  previous  sections,  we  have  referred  to  the 
backwards-reasoning  primitive  generated  by  our  method  as  WCV,  which  is  correct  for  the  sit¬ 
uation  considered  in  §4.3  and  4.4,  namely  languages  whose  primitive  statements/instructions  are 
total  and  deterministic. 

In  the  terminology  of  relational  semantics  [171],  one  considers  two  backwards-reasoning  prim¬ 
itives,  pre  and  pre,  defined  as  follows  (where  R  is  a  binary  relation  on  Q,  and  p  defines  a  subset 
of  Q ): 

pre[R](<p)  =  3 q'.  (R(q,  q')  A  p{q')) 
pre[R}(p)  =  Vg'.  (. R{q,q ')  =>  p{q')) 

pre  specifies  the  set  of  all  predecessors  in  R  of  states  that  satisfy  p.  pre  specifies  the  largest  set  of 
states  such  that  for  each  state  q  all  successors  of  q  (possibly  the  empty  set)  satisfy  p. 

The  backwards-reasoning  primitive  considered  in  §4.3  and  4.4  could  be  referred  to  as  either 
pre  or  pre,  because  the  two  operators  are  identical  for  total,  deterministic  transitions.  For  a  non- 
deterministic  transition  system,  however,  pre  and  pre  are  different.  For  instance,  execution  of  the 
havoc  (x)  primitive  of  BoogiePL  [48]  assigns  an  arbitrary  value  to  x.  For  havoc  (x) ,  pre  and  pre 
are  defined  as  follows: 

pre  [havoc  (x)](<p)  =  3x.  p 
pre  [havoc  (x)](<p)  =  \/x.  p 

The  following  example  shows  that  the  backwards-reasoning  primitive  created  by  our  technique 


behaves  similarly  to  pre. 
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Example  4.15  Consider  what  the  backwards-reasoning  primitive  creates  for  eax  [=]  5  with  respect 
to  CALL  randlnt32 : 

Z[ CALL  randInt32\U m 

{choicelndex'  ( Ujd\l)(choiceIndex)\  +  1 1,  1  ^ 

eax'  <— ■  ({Uid]2){FchoiceMap))({Uid]l){choiceIndex))) 

SUm  T2)  / 

^  J  choicelndex'  choicelndex  |  +  1 1,  n 
-  ^  eax'  —  FchoiceMap  (choicelndex) 

—  U\ 

WjCV(CALL  randlnt32 ,  eax|  =  |5) 

=  7s[eax[  =  ]5]C/i 
=  FchoiceMap  {choicelndex)  [=]  5 


□ 


FchoiceMap  can  be  thought  of  as  an  array  of  logical  variables.  In  the  quantifier-free  logic  we  work 
with,  formulas  are  implicitly  existentially  quantified.  Letting  v  denote  FckoiceMap{choiceIndex),  the 
formula  FchoiceMap  ( choicelndex )  [==]  5  can  be  thought  of  as  the  quantifier- free  version  of  the  formula 
3v.v[^]5,  which  corresponds  to  pne[havoc(v)](t>[=]5). 

Thus,  in  earlier  sections  it  would  have  been  more  precise  to  have  referred  to  the  backwards- 
reasoning  primitive  as  pre,  rather  than  \V CP — although  the  term  \V CP  was  also  correct  because 
earlier  sections  dealt  only  with  languages  whose  primitive  statements/instructions  are  total  and 
deterministic. 

Guaranteed  Replay  in  the  Presence  of  Non-Determinism.  The  application  of  directed  test 
generation  [54,  94,  95,  167]  requires  path  constraints  that  enable  the  test-generation  system  to 
create  new  test  inputs  that  are  guaranteed  to  follow  a  particular  path  through  the  program.5  In 
particular,  during  forward  symbolic  evaluation,  we  want  path-constraint  generation  (Fig.  4.13)  to 


5See  §4.7  and  4.8  for  more  detailed  discussion  of  systems  for  directed  test  generation. 
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Figure  4.15  In  a  symbolic  evaluation  of  the  trace  from  Start  to  P,  the  three  path  constraints 
obtained  from  the  branch  instructions  at  B0,  B\ ,  and  B2  constrain  the  values  of  FchoiceMap( 0), 
FchoiceMap(  1),  and  FchoiceMaP{ 2),  respectively.  To  create  a  new  initial  state  that  causes  a  concrete 
execution  of  the  program  to  follow  the  same  path,  except  to  branch  the  opposite  way  at  B2  (to 
reach  Q ),  we  need  the  satisfying  assignment  returned  by  the  theorem  prover  to  satisfy  the 
constraints  on  FchoiceMap( 0)  and  FchoiceMap(  1)  and  the  negated  constraint  on  FchoiceMap{ 2). 


produce  a  formula  such  that  when  a  theorem  prover  is  able  to  provide  an  assignment  that  satisfies 
the  formula,  the  satisfying  assignment  serves  as  an  initial  state  that  will  cause  concrete  execution 
of  the  program  to  follow  a  specific  path.  The  paths  of  interest  are  ones  that  replay  at  least  part  of  a 
previous  execution  trace. 

The  situation  is  illustrated  in  Fig.  4.15.  During  directed  test  generation,  suppose  that  a  concrete 
execution  trace  T  follows  the  path  from  Start  to  P.  Associated  with  T  are  three  path  constraints 
obtained  from  the  branch  instructions  at  B0,  B1,  and  B2.  The  three  constraints  constrain  the 
values  of  FchoiceMap( 0),  FchoiceMap(  1),  and  FchoiceMap( 2),  respectively.  To  increase  branch  coverage,  a 
directed-test-generation  tool  would  like  to  obtain  an  initial  state  that  drives  the  program  along  the 
same  path,  except  when  it  reaches  B2,  when  the  program  should  proceed  to  0. 

With  the  scheme  presented  in  this  section,  the  theorem  prover  is  able  to  create  such  an  initial 
state  by  providing  initial  values  for  the  first  three  entries  of  FciWiceMap  (which  models  the  random 
map  choiceMap). 

Repeatability  comes  from  the  fact  that  we  have  kept  the  concrete  semantics  deterministic  by, 
in  essence,  recording  all  non-deterministically  chosen  values  in  a  kind  of  shadow  input  stream. 
As  a  result,  repeatability  is  automatically  obtained  for  both  symbolic  evaluation  as  well  as  WjCV. 
In  each  case,  for  a  given  path  we  obtain  an  assignment  for  the  input  that  forces  execution  along 
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TSL  Specifications 

Generated  C++  Templates 

m 

J[-]UT[.]UF£[-]UW[-] 

m 

F[-]UT[.]UF£[-]UW[-] 

x86 

3,524 

1,510 

23,109 

15,632 

PowerPC 

1,546 

(already  written) 

12,153 

15,632 

Figure  4.16  The  number  of  (non-blank)  lines  of  C++  that  are  generated  from  the  TSL 
specifications  of  the  x86  and  PowerPC  instruction  sets  (as  of  Apr.  2010).  The  number  of 
(non-blank)  lines  of  TSL  are  indicated  in  bold. 


that  path:  in  symbolic  evaluation,  one  works  forwards  and  collects  path  constraints;  in  VVC'P.  one 
works  backwards  starting  from  T;  the  solver  is  constrained  to  return  an  assignment  that,  at  each 
branch  instruction,  causes  a  concrete  execution  to  branch  in  the  direction  that  stays  on  the  path. 

4.7  Implementation  and  Evaluation 

We  used  TSL  to  (1)  define  the  syntax  of  L\-\  as  a  user-defined  datatype;  (2)  create  a  reinterpre¬ 
tation  based  on  L[-\  formulas;  (3)  define  the  semantics  of  L[-\  by  writing  functions  that  correspond 
to  T,  T ,  etc.;  and  (4)  apply  reinterpretation  (2)  to  the  meaning  functions  of  L[)  itself.  (We  already 
had  in  hand  TSL  specifications  of  x86  and  PowerPC.) 

When  semantic  reinterpretation  is  performed  in  the  manner  supported  by  TSL,  it  is  independent 
of  any  given  subject  language.  Consequently,  now  that  we  have  carried  out  steps  (l)-(4),  all  three 
symbolic-analysis  primitives  can  be  generated  automatically  for  a  new  instruction  set  IS  merely  by 
writing  a  TSL  specification  of  IS,  and  then  applying  the  TSL  compiler.  In  essence,  TSL  acts  as  a 
“Y ACC-like”  tool  for  generating  symbolic-analysis  primitives  from  a  semantic  description  of  an 
instruction  set. 

To  illustrate  the  leverage  gained  by  using  the  approach  presented  in  this  chapter,  the  table 
shown  in  Fig.  4.16  lists  the  number  of  (non-blank)  lines  of  C++  that  are  generated  from  the  TSL 
specifications  of  the  x86  and  PowerPC  instruction  sets.  The  number  of  (non-blank)  lines  of  TSL 
are  indicated  in  bold. 
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In  addition  to  the  components  for  concrete  and  symbolic  evaluation,  one  also  obtains  an  imple¬ 
mentation  of  WCP — via  the  method  described  in  §4.3 — by  calling  the  C++  implementations  of 
JC[-]  and  X[-J:  WjCV(s,  p)  =  T\p\  (X[[.S']f/,y).  By  Thm.  4.9  of  Appendix  B,  WCV  is  guaranteed 
to  be  consistent  with  the  components  for  concrete  and  symbolic  evaluation  (modulo  bugs  in  the 
implementation  of  TSL). 

Evaluation.  Some  tools  that  use  symbolic  reasoning  employ  formula  transformations  that  are  not 
faithful  to  the  actual  semantics.  For  instance,  the  SAGE  system  for  directed  test  generation  [95] 
uses  an  approximate  x86  symbolic  evaluation  in  which  concrete  values  are  used  when  non-linear 
operators  or  symbolic  pointer  dereferences  are  encountered.  As  a  result,  its  symbolic  evaluation 
of  a  path  can  produce  an  “unfaithful”  path-constraint  formula  p;  that  is,  an  actual  execution  path 
may  not  match  the  program  path  predicted  by  the  path-constraint  formula  p.  This  situation  is 
called  a  divergence  [95].  Because  the  intended  use  of  SAGE  is  to  generate  inputs  that  increase 
coverage,  it  can  be  acceptable  for  the  tool  to  have  a  substantial  divergence  rate  (due  to  the  use  of 
unfaithful  symbolic  techniques)  if  the  cost  of  performing  symbolic  operations  is  lowered  in  most 
circumstances. 

In  contrast  with  directed  test  generation,  to  model  check  machine  code  [120,  174]6  an  imple¬ 
mentation  of  a  faithful  symbolic  technique  is  required.  A  faithful  symbolic  technique  could  raise 
the  cost  of  performing  symbolic  operations  because  faithful  path-constraint  formulas  could  be  a 
great  deal  more  complex  than  unfaithful  ones.  Thus,  our  experiment  was  designed  to  answer  the 
question 

“What  is  the  cost  of  using  exact  symbolic-evaluation  primitives  instead  of  unfaithful 
ones?” 

It  would  have  been  an  error-prone  task  to  implement  a  faithful  symbolic-evaluation  primitive  for 
x86  machine  code  manually.  Using  TSL,  however,  we  were  able  to  generate  a  faithful  symbolic- 
evaluation  primitive  from  an  existing,  well-tested  TSL  specification  of  the  semantics  of  x86  in¬ 
structions.  We  also  generated  an  unfaithful  symbolic-evaluation  primitive  that  adopts  SAGE’s 
6The  model-checking  tool  for  machine  code  is  described  in  §5.1 
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approximate  approach.  We  used  these  to  create  two  directed-test-generation  tools  that  perform 
state-space  exploration — one  that  uses  the  faithful  primitive,  and  one  that  uses  the  unfaithful  prim¬ 
itive. 

Although  the  presentation  in  earlier  sections  was  couched  in  terms  of  simplified  core  languages, 
the  implemented  tools  work  with  real  x86  programs.  Our  experiment  used  seven  C++  programs, 
each  exercising  a  single  algorithm  from  the  C++  STL,  compiled  under  Visual  Studio  2005. 

To  compare  the  two  tools’  divergence  rates  and  running  times,  we  used  the  algorithm  shown 
in  Fig.  4.17.  All  execution  runs  were  performed  on  a  single  core  of  a  quad-core  3.0GHz  Pentium 
Xeon  processor  running  Windows  XP,  configured  so  that  a  user  process  has  4  GB  of  memory. 
Tab.  4. 1  shows  the  divergence  rates  and  running  times  that  we  measured. 

Tab.  4.1  reports  the  number  of  tests  executed,  the  average  length  of  the  trace  obtained  from 
the  tests,  and  the  average  number  of  branches  in  the  traces.  For  the  faithful  version,  we  report  the 
average  time  taken  for  concrete  execution  (CE)  and  symbolic  evaluation  (SE).  In  the  approximate 
(“unfaithful”)  version,  concrete  execution  and  symbolic  evaluation  were  done  in  lock  step  and  their 
total  time  is  reported  in  (CE+SE).  (All  times  are  in  seconds.)  For  each  version,  we  also  report  the 
average  time  taken  by  the  SMT  solver  (Yices  [82]),  the  average  number  of  constraints  found  (|</?|), 
and  the  divergence  rate.  For  the  approximate  version,  we  also  show  the  average  distance  (as  a 
percentage  of  the  total  length  of  the  trace)  before  a  diverging  test  diverged.  TF/TA  denotes  the 
ratio  of  the  times  (CE+SE+SMT)  for  the  faithful  version  and  the  approximate  version. 

On  average,  the  unfaithful  primitive  had  a  57%  divergence  rate  (computed  as  the  arithmetic 
mean  of  the  seven  measured  divergence  rates),  whereas  no  divergences  were  reported  for  the  faith¬ 
ful  primitive.  The  faithful  primitive  had  9.27  times  more  constraints  in  ip  than  the  unfaithful  prim¬ 
itive  (computed  as  the  geometric  mean  of  the  ratios  of  the  two  versions  for  the  seven  programs), 
and  was  about  1.07  times  slower  than  the  unfaithful  version  (geometric  mean). 

4.8  Related  Work 

Symbolic  analysis  is  used  in  many  recent  systems  for  testing  and  verification: 
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a  :=  a  random  initial  input  state 

Perform  concrete  execution,  starting  with  input  state  a,  and  obtain  the  trace  T 
numTracesConsidered  :=  0;  divergences^^  :=  0;  divergences^,, ^  :=  0 
Worklist  :=  {(a,  T)};  AlreadyConsideredTraces  :=  0 
while  Worklist  ^  0  and  numTracesConsidered  <  threshold  do 
Select  and  remove  a  pair  (a,  T)  from  Worklist 

Perform  two  symbolic  evaluations  of  T  using  the  faithful  and  unfaithful  symbolic  primitives, 
respectively,  generating  branch  predicates  for  each  branch  instruction  in  T 
Let  B\ ,  B-2 ,  ■  ■  . ,  Bk  be  the  branch  instructions,  in  order,  in  T 

for  i  :=  k  downto  1  do 

For  each  of  the  two  symbolic  evaluations,  conjoin  all  the  branch  predicates  in  T  prior  to  B, 
with  the  negation  of  the  branch  predicate  for  B,  in  T,  creating  path  formulas  ip  faithful  and 

P unfaithful ,  respectively 

Tb+  :=  the  prefix  of  T  up  to  and  including  £>,,  plus  the  intended  successor  of  B, 
if  Tb+  €  AlreadyConsideredTraces  then 

Break  /*  Exit  the  for  loop;  all  prefixes  of  TB+  are  in  AlreadyConsideredTraces,  too  */ 
else 

Insert  TB+  into  AlreadyConsideredTraces 

end  if 

if  ip  faithful  is  unsatishable  then 

Continue  /*  Go  to  the  next  iteration  of  the  for  loop  */ 

end  if 

& faithful  '■=  a  satisfying  assignment  for  <p /aiWl/ui 

Perform  concrete  execution,  starting  with  input  state  a 'faithful,  and  obtain  the  trace  T' 
numTracesConsidered  :=  numTracesConsidered  +  1 
if  T'  does  not  match  TB+  then 
Increment  divergences^,,^  by  1 

end  if 

if  <P unfaithful is  unsatishable  then 
Increment  divergences^-, hful  by  1 
else 

^unfaithful  •=  a  satisfying  assignment  for  unfaithful 

Perform  concrete  execution,  starting  with  input  state  cr'f(athjul,  and  obtain  the  trace  T" 
if  T"  does  not  match  TB+  then 
Increment  divergences, by  1 

end  if 
end  if 

Insert  {a'faithful,  T)  into  Worklist 

end  for 
end  while 


Figure  4.17  Directed-test-generation  algorithm  used  for  comparing  the  divergence  rates  of  the 
faithful  and  unfaithful  symbolic-evaluation  primitives. 
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Name 

#Tests 

Trace 

#Branches 

Faithful 

Approximate 

Slowdown 

(STL) 

(#Instrs) 

CE 

CE+SE 

SMT 

M 

Div. 

Dist. 

(Tf/Ta) 

copy 

12 

1462 

19 

0.3 

3.44 

0.017 

6 

0% 

3.58 

0.013 

i 

50% 

93% 

1.05 

equal 

202 

1604 

64 

0.33 

5.56 

0.48 

54 

0% 

5.75 

0.46 

24 

60% 

73% 

1.11 

find 

344 

1240 

174 

0.15 

5.34 

0.2 

144 

0% 

5.31 

0.17 

85 

50% 

82% 

1.07 

partition 

19 

1293 

43 

0.24 

5.26 

0.79 

43 

0% 

5.43 

0.26 

1 

73% 

87% 

1.16 

random_shuffle 

94 

2448 

71 

0.48 

7.56 

0.028 

37 

0% 

7.88 

0.014 

1 

48% 

99% 

1.03 

search 

274 

1422 

107 

0.33 

6.3 

0.17 

59 

0% 

6.37 

0.13 

31 

55% 

89% 

1.07 

transform 

200 

3749 

95 

0.82 

18.56 

0.05 

85 

0% 

19.36 

0.012 

1 

64% 

99% 

1.00 

Table  4.1  Experimental  results.  Key:  CE  =  time  for  concrete  execution;  SE  =  time  for  symbolic 
execution;  SMT  =  solver  time;  \<p\=  avg.  number  of  constraints  found;  Div.  =  divergence  rate; 

CD+SE  =  time  for  concrete  +  symbolic  execution  (when  run  in  lock-step);  Dist.  =  avg.  distance 
before  a  diverging  test  diverges.  Tf/Ta  denotes  the  ratio  of  the  times  (CE+SE+SMT)  for  the 
faithful  version  and  the  approximate  version.  (All  times  are  in  seconds.) 

•  Hybrid  concrete/symbolic  tools  for  directed  test  generation  [54,  94,  95,  167]  use  a  combina¬ 
tion  of  concrete  and  symbolic  evaluation  to  generate  inputs  that  increase  coverage.  They  use 
concrete  evaluation  to  identify  an  executable  path  it.  They  use  symbolic  evaluation  to  obtain 
a  path  formula  for  n,  then  change  the  formula  to  be  one  for  a  path  it'  that  follows  the  same 
sequence  of  branches  as  i r,  except  that  at  the  final  branch  node  it'  branches  in  the  direction 
opposite  to  the  one  taken  by  it,  and  call  an  SMT  solver  to  determine  if  there  is  an  input  that 
drives  the  program  down  it' . 

•  \V CP  can  be  used  to  create  new  predicates  that  split  part  of  a  program’s  abstract  state  space 
[46,  49], 

•  Symbolic  composition  is  useful  when  a  tool  has  access  to  a  formula  that  summarizes  a  called 
procedure’s  behavior  [186];  re-exploration  of  the  procedure  is  avoided  by  symbolically  com¬ 
posing  a  path  formula  with  the  procedure- summary  formula. 

However,  compared  with  the  way  such  symbolic-analysis  primitives  are  implemented  in  existing 
program-analysis  tools,  our  work  has  one  definite  advantage:  it  creates  the  key  concrete-execution 
and  symbolic-analysis  components  in  a  way  that  ensures  by  construction  that  they  are  mutually 
consistent. 
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We  use  a  declarative  approach :  one  provides  a  specification  of  the  subject  language’s  standard 
semantics;  then,  as  described  in  §4.3  and  4.4,  mutually-consistent  implementations  of  symbolic 
evaluation,  WCV.  and  symbolic  composition  are  obtained  from  the  subject  language’s  standard 
semantics  by  (i)  reinterpreting  meta-language  constructs  in  terms  of  logic,  and  (ii)  reinterpreting  a 
logic’s  meaning  functions.  The  advantage  of  this  approach  is  that  one  obtains  implementations  of 
(a)  concrete  execution,  (b)  symbolic  evaluation,  (c)  WCV,  and  (d)  symbolic  composition  from  a 
single  specification,  which  removes  the  possibility  of  different  analysis  components  having  differ¬ 
ent  “views”  of  the  semantics. 

It  appears  to  be  the  case  that  in  most  tools,  the  concrete-execution  and  symbolic-analysis  prim¬ 
itives  are  not  implemented  in  a  way  that  guarantees  such  a  consistency  property.  For  instance,  in 
the  source  code  for  B2  [106]  (the  next-generation  BLAST),  one  finds  symbolic  evaluation  (post) 
and  WCV  implemented  with  different  pieces  of  code,  and  hence  mutual  consistency  is  not  guar¬ 
anteed.  WCV  is  implemented  via  substitution,  with  special-case  code  for  handling  pointers.  Any 
modification  of  the  B2  intermediate  representation  would  require  changing  both  post  and  WCV, 
and  possibly  rethinking  the  substitution  method. 

Recently,  directed-test-generation  tools  have  been  created  for  x86  executables — e.g.,  SAGE 
[95]  and  BITSCOPE  [54], 

•  BITSCOPE  is  a  framework  that  takes  an  x86  executable  and  provides  information  about  exe¬ 
cution  paths  that  can  be  used  for  additional,  more  specific  analyses,  such  as  finding  out  what 
inputs  cause  erroneous  behavior.  To  perform  symbolic  evaluation,  they  first  translate  each 
x86  instruction  into  an  intermediate  representation  that  is  designed  to  model  the  semantics 
of  the  original  x86  instruction,  including  all  implicit  side  effects  (such  as  flags  that  are  set), 
register  addressing  modes,  and  other  issues.  Symbolic  evaluation  is  performed  on  the  IR 
with  a  symbolic  transformer  for  each  IR  statement. 

•  SAGE  is  a  white-box  fuzz-testing  tool  for  x86  Windows  applications  [95].  The  system  uses 
offline,  trace-based  constraint  generation:  concrete  execution  and  symbolic  evaluation  are 
performed  over  a  separately  recorded,  replay  able  execution  trace  in  which  the  outcome  of 
each  nondeterministic  event  encountered  during  the  recorded  run  has  been  captured.  To 
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generate  path  constraints,  SAGE  maintains  a  concrete  state  and  a  symbolic  state — a  pair  of 
stores  that  associate  each  memory  location  and  register  to  a  byte-sized  value  and  a  symbolic 
tag,  which  is  an  expression  that  represents  either  an  input  value  or  a  function  of  some  input 
values.  A  symbolic  tag  is  propagated  on  the  trace  during  the  process  of  symbolic  evalua¬ 
tion  by  using  a  symbolic  transformer  written  specifically  for  each  instruction.  The  concrete 
store  is  sometimes  used  to  concretize  symbolic  values  that  are  overly  complex.  In  SAGE, 
symbolic  pointer  dereferences  are  intentionally  ignored  to  reduce  complexity.  SAGE  could 
be  improved  to  increase  coverage  by  using  more  precise  path  constraints  created  from  the 
symbolic-evaluation  primitive  produced  by  our  technique.  §4.7  shows  that  the  faithful  con¬ 
straints  created  by  our  technique  dramatically  reduce  the  number  of  divergences  with  only  a 
modest  (7%)  increase  in  running  time. 

BITSCOPE  uses  the  approach  of  translating  each  instruction  to  a  common  intermediate  repre¬ 
sentation  (CIR)  (see  §4.1),  which  provides  a  level  of  assurance  that  the  concrete-execution  and 
symbolic-evaluation  components  are  mutually  consistent.  SAGE  uses  independently  created  com¬ 
ponents  for  capturing  execution  traces  and  for  path-constraint  generation.  It  also  uses  approximate 
techniques  during  the  symbolic-evaluation  part  of  constraint  generation;  hence,  the  treatment  of 
program  semantics  in  SAGE  is  definitely  inconsistent,  which  causes  divergences.  (W  CP  and 
symbolic  composition  do  not  play  a  role  in  either  SAGE  or  BITSCOPE.) 

Relationship  to  Partial  Evaluation,  Binding-Time  Analysis,  and  2-Level  Semantics.  In  gen¬ 
eral,  the  semantic  definition  of  an  imperative  programming  language  is  a  meaning  function  X  with 
type  X  :  Stmt  x  State  — >  Stale.  The  objective  of  a  primitive  for  symbolic  evaluation  can  be  stated 
as  follows: 

Given  the  semantic  definition  of  a  programming  language,  X  :  Stmt  x  State  — >  State, 
together  with  a  specific  programming-language  statement  (or  instruction)  s  G  Stmt, 
create  a  logical  formula  that  captures  the  semantics  of  s. 

Given  such  a  goal  for  the  primitive  to  be  created,  it  is  not  surprising  that  partial-evaluation  tech¬ 
niques  come  into  play  in  the  tool  that  generates  implementations  of  such  primitives.  In  essence,  we 
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wish  to  partially  evaluate  X  with  respect  to  Stmt  s  so  that  the  residual  object  captures  the  semantics 
of  s,  while  at  the  same  time  the  result  is  translated  to  L.  Semantic  reinterpretation  permits  us  to  do 
this:  let  Us  be  the  StructUpdate  X[s]f/,y.  Then  Us  is  the  partial  evaluation  of  X  with  respect  to  s, 
translated  to  logic. 

In  our  implementation,  the  TSL  system  is  supplied  with  a  TSL  program  for  the  meaning  func¬ 
tion  X  (i.e.,  interplnstr).  Although  TSL  is  not  a  partial-evaluation  system  per  se,  for  reasons  dis¬ 
cussed  in  §3.2.1,  the  TSL  compiler  performs  binding-time  analysis  [108],  and  annotates  the  code 
for  interplnstr  to  create  an  intermediate  representation  in  a  two-level  language  [149].  In  our  case, 
Level  1  corresponds  to  parameter  I  of  interplnstr,  and  Level  2  corresponds  to  parameter  state. 
To  generate  implementations  of  symbolic-analysis  primitives  via  semantic  reinterpretation,  we  use 
two  different  reinterpretations  for  the  two  levels: 

•  Concrete  semantics  (C)  for  Level  1 . 

•  Something  close  to  the  Herbrand  interpretation  (H)  for  Level  2:  operators  of  L  are  used  as 
syntactic  constructors,  but  algebraic  simplifications  are  performed  whenever  possible. 

Let  interplnstr— CH  denote  interplnstr-21evel  reinterpreted  in  this  fashion.  When 
interpInstr-CH  is  executed,  it  creates  a  residual  expression  as  output.  Because  concrete  seman¬ 
tics  is  used  for  level  1,  all  parts  of  interplnstr  that  are  not  relevant  to  the  form  of  I  are  eliminated. 

Overall,  the  TSL  compiler  and  the  two  interpretations  create  something  that  is  very  similar 
to  a  generating  extension  [108]  interplnstr-gen  for  interplnstr.  If  p  is  a  two-input  program,  a 
generating  extension  p-gen  is  any  program  with  the  property  that  for  every  input  pair  a  and  b, 

|p-gen](a)  =  pa,  where  |pj(6)  =  [p](a,6). 

Thus,  1-gen  is  a  program  such  that  for  every  statement  s  and  State  a, 

[X-genJ(s)  =ls,  where  [Xs](a)  =  [X](s,<r). 

Generating  extension  interplnstr-gen  would  be  a  program  with  the  following  property: 

[interplnstr-genj(i)  =  interplnstr^  where 
|interplnstrI](s)  =  [[interplnstr]  (I,  S). 
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interpInstr-CH  has  similar  properties: 

[interplnstr-CH](I,  Uid)  =  UI,  where 

W[l7il(S)  =  [[interplnstr]  (I,  s). 

Consequently,  interplnstr-gen  and  interpInstr-CH  are  not  the  same,  although  the  difference  be¬ 
tween  is  quite  small.  interplnstr-CH  still  requires  two  inputs  to  be  supplied  (but  we  could  use  the 
trivial  value  Uici  for  the  second  input). 

When  partial-evaluation  machinery  is  included  in  the  discussion,  the  explanation  is  complicated 
by  the  number  of  language  levels  involved.  Consequently,  in  this  chapter  we  chose  to  base  the 
discussion  on  the  simpler  principle  of  semantic  reinterpretation,  which  has  benefits  and  drawbacks: 

•  The  benefit  is  that  the  explanation  is  simpler,  and  could  also  be  useful  for  direct  hand  imple¬ 
mentation  when  a  meta-system  such  as  TSL  is  not  available. 

•  The  drawback  is  that  in  some  of  the  sections  it  may  appear  that  many  steps  perform  rather 
trivial  transliteration  of  expressions  from  programming  language  PL,  into  expressions  of  the 
corresponding  logic  L[PL,].  In  part,  this  is  an  artifact  of  trying  to  present  the  method  in  an 
easy-to-digest  manner;  in  part,  it  mimics  the  behavior  of  a  generating  extension:  copying 
(or  transliterating)  the  appropriate  residual  expression  is  one  of  the  principles  of  “writing  a 
generating  extension  by  hand”  [51,  123]. 

4.9  Conclusion 

This  chapter  presents  a  way  to  obtain  automatically  mutually-consistent,  correct-by- 
construction  implementations  of  symbolic  primitives — in  particular,  quantifier- free,  first-order- 
logic  formulas  for  (a)  symbolic  evaluation  of  a  single  command,  (b)  WCP  with  respect  to  a  single 
command,  and  (c)  symbolic  composition  for  a  class  of  formulas  that  express  state  transforma¬ 
tions.  The  approach  presented  in  the  chapter  involves  generating  implementations  of  each  of  the 
primitives  from  a  single  specification  of  the  subject  language’s  concrete  semantics.  The  generated 
implementations  are  guaranteed  to  be  mutually  consistent  (modulo  bugs  in  the  implementation  of 
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the  program- generation  implementation),  and  also  to  be  consistent  with  an  instruction-set  emula¬ 
tor  (for  concrete  execution)  that  is  generated  from  the  same  specification  of  the  subject  language’s 
concrete  semantics. 

In  this  work,  the  method  used  to  generate  such  implementations  is  semantic  reinterpretation, 
a  technique  originally  introduced  by  Mycroft  and  Jones  [110,  144]  as  a  method  for  formulating 
abstract  interpretations.  In  this  work,  we  are  not  doing  abstract  interpretation  per  se  (i.e.,  to  over¬ 
approximate  the  concrete  semantics  [73]),  but  we  take  two-fold  advantage  of  their  methodology: 
we  use  two  separate  semantic  reinterpretations — (i)  reinterpretation  of  a  programming  language ’s 
meaning  function(s),  and  (ii)  reinterpretation  of  a  logic’s  meaning  function(s).  The  two  kinds  of 
reinterpretations  define  the  key  primitives  X,  T ,  and  U  from  which  the  desired  implementations  of 
symbolic  evaluation,  WCV,  and  symbolic  composition  are  obtained. 

As  far  as  we  are  aware,  the  application  of  semantic  reinterpretation  to  a  logic  is  a  new  idea.  A 
related  innovation  on  which  our  results  rest  was  to  define  a  particular  form  of  state-transformation 
formula  (structure-update  expressions)  as  a  first-class  notion  in  the  logic.  By  this  device,  such 
formulas  could  (i)  serve  as  a  replacement  domain  in  the  reinterpretations  of  both  the  program¬ 
ming  language’s  meaning  functions  and  the  logic’s  meaning  functions,  and  (ii)  be  reinterpreted 
themselves. 

We  applied  our  technique  to  both  the  x86  and  PowerPC  instruction  sets,  using  the  TSL  system 
as  our  implementation  platform.  §4.7  discusses  the  substantial  leverage  that  we  obtained  using 
TSL’s  facilities  for  semantic  reinterpretation:  from  6,580  lines  of  TSL,  101,788  lines  of  C++  were 
produced  that  implement  X,  X,  T,  X,  FE,  and  U  for  x86  and  PowerPC.  Moreover,  for  each  instruc¬ 
tion  set  all  six  primitives  are  guaranteed  to  be  mutually  consistent  (modulo  bugs  in  the  implemen¬ 
tation  of  TSL  and  in  the  implementations  of  the  primitives  for  the  two  kinds  of  reinterpretations). 

As  proposed  by  Mycroft  and  Jones  [110,  144],  in  a  semantic  reinterpretation  one  refactors  the 
specification  of  a  language’s  concrete  semantics  into  a  suitable  form  by  introducing  appropriate 
combinators  that  are  subsequently  redefined.  While  this  style  of  semantic  reinterpretation  is  sup¬ 
ported  by  the  TSL  system,  ordinarily  one  never  has  to  be  concerned  with  refactoring  a  specification. 
Instead,  each  reinterpretation  is  performed  at  the  meta-level;  that  is,  each  reinterpretation  involves 
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redefining  the  approximately  40  primitives  of  the  TSL  meta-language.7  In  our  TSL-based  semantic 
reinterpretations  of  specifications  of  the  concrete  semantics  of  x86  and  PowerPC,  we  did  not  have 
to  refactor  the  specification  to  introduce  any  special  combinators. 

Finally,  we  conducted  an  experiment  that  used  the  generated  primitives  on  x86  code,  compiled 
under  Visual  Studio  2005  from  C++  STL  source  code,  to  gain  insight  on  the  question  “What  is 
the  cost  of  using  exact  symbolic-evaluation  primitives  instead  of  unfaithful  ones  in  a  system  for 
directed  test  generation?”  The  experiment  showed  that  using  exact  symbolic-analysis  primitives, 
as  opposed  to  ones  that  approximate  the  real  semantics,  is  slower  by  a  factor  of  1.07,  but  is  dra¬ 
matically  more  accurate. 


'Each  of  the  numeric  primitives  comes  in  four  bit- widths:  8-bit,  16-bit,  32-bit,  and  64-bit.  All  four  must  be  reinter¬ 
preted;  however,  generally  the  reinterpretation  of  a  given  family  of  four  such  numeric  primitives  can  be  parameterized 
on  bit-width,  so  we  only  count  each  family  as  a  single  primitive. 
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Chapter  5 
Case  Studies 

This  chapter  discusses  two  applications  that  use  the  TSL-generated  analysis  components.  Both 
applications  use  logic-based  search  procedures  to  establish  properties  of  machine-code  programs. 
Compared  to  work  by  others  on  logic-based  search  procedures  for  machine  code,  what  distin¬ 
guishes  the  work  described  in  this  chapter  is  that  both  applications  are  goal-directed.  That  is,  they 
both  have  a  target  property  or  program  point  of  interest,  and  this  target  is  used  to  focus  the  search. 
More  discussion  of  related  work  is  found  in  §5.1.5  and  §5.2.9. 

§5.1  presents  the  algorithms  used  in  MCVETO  (Machine-Code  VErification  TOol),  a  tool 
to  check  whether  a  stripped  machine-code  program  satisfies  a  safety  property.  The  verification 
problem  that  MCVETO  addresses  is  challenging  because  it  cannot  assume  that  it  has  access  to 
(i)  certain  structures  commonly  relied  on  by  source-code  verification  tools,  such  as  control-flow 
graphs  and  call-graphs,  and  (ii)  meta-data,  such  as  information  about  variables,  types,  and  aliasing. 
It  cannot  even  rely  on  out-of-scope  local  variables  and  return  addresses  being  protected  from  the 
program’s  actions.  What  distinguishes  MCVETO  from  other  work  on  software  model  checking 
is  that  it  shows  how  verification  of  machine  code  can  be  performed,  while  avoiding  conventional 
techniques  that  would  be  unsound  if  applied  at  the  machine-code  level. 

Botnets  are  a  major  threat  to  the  security  of  computer  systems  and  the  Internet.  An  increasing 
number  of  individual  Internet  sites  have  been  compromised  by  attacks  from  across  the  world  to  be¬ 
come  part  of  various  kinds  of  malicious  botnets.  §5.2  presents  a  tool,  called  BCE,  for  automatically 
extracting  botnet-command  information  from  bot  executables.  BCE  helps  analyzing  the  behavior 
of  bots  by  providing  proper  input  commands  that  trigger  malicious  behaviors. 


150 


Both  applications  make  use  of  TSL- generated  analysis  components,  including  concrete  execu¬ 
tion  as  well  as  the  symbolic-analysis  primitives  presented  in  Chapter  4.  MCVETO  also  uses  several 
TSL-generated  static-analysis  components,  including  ARA  (§3.3.2)  and  ASI  (§3.3.4). 

5.1  MCVETO 

As  discussed  in  Chapter  2,  machine-code  analysis  presents  many  new  challenges.  For  instance, 
at  the  machine-code  level,  memory  is  one  large  byte-addressable  array,  and  an  analyzer  must  han¬ 
dle  computed — and  possibly  non-aligned — addresses.  It  is  crucial  to  track  array  accesses  and 
updates  accurately;  however,  the  task  is  complicated  by  the  fact  that  arithmetic  and  dereferencing 
operations  are  both  pervasive  and  inextricably  intermingled.  For  instance,  if  local  variable  x  is  at 
offset  -12  from  the  activation  record’s  frame  pointer  (register  ebp),  an  access  on  x  would  be  turned 
into  an  operand  [ebp-12].  Evaluating  the  operand  first  involves  pointer  arithmetic  (“ebp-12”)  and 
then  dereferencing  the  computed  address  (“[■]”).  On  the  other  hand,  machine-code  analysis  also 
offers  new  opportunities,  in  particular,  the  opportunity  to  track  low-level,  platform- specific  details, 
such  as  memory-layout  effects.  Programmers  are  typically  unaware  of  such  details;  however,  they 
are  often  the  source  of  exploitable  security  vulnerabilities. 

The  algorithms  used  in  software  model  checkers  that  work  on  source  code  [47,  49,  102]  would 
be  be  unsound  if  applied  to  machine  code.  For  instance,  before  starting  the  verification  process 
proper,  SLAM  [47]  and  BLAST  [102]  perform  flow-insensitive  (and  possibly  field- sensitive)  points- 
to  analysis.  However,  such  analyses  often  make  unsound  assumptions,  such  as  assuming  that  the 
result  of  an  arithmetic  operation  on  a  pointer  always  remains  inside  the  pointer’s  original  target. 
Such  an  approach  assumes — without  checking — that  the  program  is  ANSI  C  compliant,  and  hence 
causes  the  model  checker  to  ignore  behaviors  that  are  allowed  by  some  compilers  (e.g.,  arithmetic 
is  performed  on  pointers  that  are  subsequently  used  for  indirect  function  calls;  pointers  move  off 
the  ends  of  structs  or  arrays,  and  are  subsequently  dereferenced).  A  program  can  use  such  features 
for  good  reasons — e.g.,  as  a  way  for  a  C  program  to  simulate  subclassing  [172] — but  they  can  also 
be  a  source  of  bugs  and  security  vulnerabilities. 
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In  this  work,  we  developed  a  model  checker  for  machine  code,  called  MCVETO  (Machine- 
Code  VErification  TOol).1  MCVETO  uses  directed  proof  generation  (DPG)  [98]  to  find  either  an 
input  that  causes  a  (bad)  target  state  to  be  reached,  or  a  proof  that  the  bad  state  cannot  be  reached. 
(The  third  possibility  is  that  MCVETO  fails  to  terminate.) 

What  distinguishes  the  work  on  MCVETO  is  that  it  addresses  a  large  number  of  issues  that  have 
been  ignored  in  previous  work  on  software  model  checking,  and  would  cause  previous  techniques 
to  be  unsound  if  applied  to  machine  code.  The  contributions  of  our  work  can  be  summarized  as 
follows: 

1 .  We  show  how  to  verify  safety  properties  of  machine  code  while  avoiding  a  host  of  assump¬ 
tions  that  are  unsound  in  general,  and  that  would  be  inappropriate  in  the  machine-code  con¬ 
text,  such  as  reliance  on  symbol-table,  debugging,  or  type  information,  and  preprocessing 
steps  for  (a)  building  a  precomputed,  fixed,  interprocedural  control-flow  graph  (ICFG),  or 
(b)  performing  points-to/alias  analysis. 

2.  MCVETO  builds  its  (sound)  abstraction  of  the  program’s  state  space  on-the-fly,  performing 
disassembly  one  instruction  at  a  time  during  state-space  exploration,  without  static  knowl¬ 
edge  of  the  split  between  code  vs.  data.  (It  does  not  have  to  be  prepared  to  disassemble 
collections  of  nested  branches,  loops,  procedures,  or  the  whole  program  all  at  once,  which  is 
what  can  confuse  conventional  disassemblers  [128].) 

The  initial  abstraction  has  only  two  abstract  states,  defined  by  the  predicates  “PC  =  target ” 
and  “PC  f  target ”  (where  “PC”  denotes  the  program  counter).  The  abstraction  is  gradually 
refined  as  more  of  the  program  is  exercised  (§5.1.2).  MCVETO  can  analyze  programs  with 
instruction  aliasing2  because  it  builds  its  abstraction  of  the  program’s  state  space  entirely  on- 
the-fly.  Moreover,  MCVETO  is  capable  of  verifying  (or  detecting  flaws  in)  self-modifying 
code  (SMC).  With  SMC  there  is  no  fixed  association  between  an  address  and  the  instruction 

1MCVETO  was  carried  out  in  collaboration  primarily  with  A.  Thakur,  A.  Lai,  and  T.  Reps,  along  with  A.  Burton, 
D.  Driscoll,  M.  Elder,  and  T.  Andersen.  My  contribution  to  the  work  consisted  of  the  TSL-generated  anaysis  com¬ 
ponents  for  concrete  execution  and  symbolic  execution,  discussed  in  Chapter  4,  along  with  the  development  of  the 
techniques  described  in  §5. 1.2.1  and  §5. 1.2.2. 

2Programs  written  in  instruction  sets  with  varying-length  instructions,  such  as  x86,  can  have  “hidden”  instructions 
starting  at  positions  that  are  out  of  registration  with  the  instruction  boundaries  of  a  given  reading  of  an  instruction 
stream  [128]. 
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at  that  address,  but  this  is  handled  automatically  by  MCVETO’s  mechanisms  for  abstraction 
refinement.  To  the  best  of  our  knowledge,  MCVETO  is  the  first  model  checker  to  handle 
SMC. 

3.  MCVETO  introduces  trace  generalization ,  a  new  technique  for  eliminating  families  of  infea¬ 
sible  traces.  Compared  to  prior  techniques  that  also  have  this  ability  [50,  101],  our  technique 
involves  no  calls  on  an  SMT  solver ,  and  avoids  the  potentially  expensive  step  of  automaton 
complementation . 

4.  MCVETO  introduces  a  new  approach  to  performing  DPG  (Directed  Proof  Generation)  on 
multi-procedure  programs.  Godefroid  et  al.  [96]  presented  a  declarative  framework  that  cod¬ 
ifies  the  mechanisms  used  for  DPG  in  SYNERGY  [98],  DASH  [49],  and  SMASH  [96]  (which 
are  all  instances  of  the  framework).  In  their  framework,  /»/<? /'procedural  DPG  is  performed 
by  invoking  /n/raproccdural  DPG  as  a  subroutine.  In  contrast,  MCVETO’s  algorithm  lies 
outside  of  that  framework:  the  interprocedural  component  of  MCVETO  uses  (and  refines)  an 
infinite  graph,  which  is  finitely  represented  and  queried  by  symbolic  operations. 

5.  We  developed  a  language-independent  algorithm  to  identify  the  aliasing  condition  relevant 
to  a  property  in  a  given  state  (§5. 1.2.1).  Unlike  previous  techniques  [49],  it  applies  when 
static  names  for  variables/objects  are  unavailable. 

6.  We  developed  several  techniques  to  enhance  the  methods  used  during  DPG  to  elaborate  the 
abstraction  in  use.  Although  these  techniques  are  speculative,  soundness  is  retained  at  all 
times. 

Items  1  and  2  address  execution  details  that  are  typically  ignored  (unsoundly)  by  source-code 
analyzers.  Item  2  is  specific  to  machine-code  analysis.  3,  4,  5,  and  6  are  applicable  to  both  source- 
code  and  machine-code  analysis.  MCVETO  is  not  restricted  to  an  impoverished  language.  In 
particular,  it  handles  pointers  and  bit- vector  arithmetic . 

We  implemented  MCVETO  in  a  language-independent  way  by  using  the  TSL  system  to  im¬ 
plement  the  analysis  components  needed  by  MCVETO — i.e.,  (a)  an  emulator  for  running  tests, 
(b)  a  primitive  for  performing  symbolic  execution,  and  (c)  a  primitive  for  the  pre-image  operator 
(Pre).  In  addition,  we  developed  language-independent  approaches  to  the  issues  discussed  above 
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(e.g.,  item  5).  As  discussed  in  Chapter  3,  the  TSL  system  acts  as  a  “Y ACC-like”  tool  for  creating 
versions  of  MCVETO  for  different  instruction  sets:  given  an  instruction-set  description,  a  version 
of  MCVETO  is  generated  automatically.  We  created  two  such  instantiations  of  MCVETO  from 
descriptions  of  the  Intel  x86  and  PowerPC  instruction  sets. 

The  remainder  of  this  section  is  organized  as  follows:  §5.1.1  contains  a  brief  review  of  DPG. 
§5.1.2  explains  the  methods  used  to  achieve  the  contributions  of  MCVETO.  §5.1.3  describes  how 
different  instances  of  MCVETO  are  generated  automatically  by  using  the  TSL  system.  §5.1.4 
presents  experimental  results.  §5.1.5  discusses  related  work.  §5.1.6  concludes. 

5.1.1  Background  on  Directed  Proof  Generation  (DPG) 

Given  a  program  P  and  a  particular  control  location  target  in  P,  DPG  returns  either  an  input  for 
which  execution  leads  to  target  or  a  proof  that  target  is  unreachable  (or  DPG  does  not  terminate). 
Two  approximations  of  P’s  state  space  are  maintained: 

•  A  set  T  of  concrete  traces,  obtained  by  running  P  with  specific  inputs.  T  wnc/crapproxi  mates 
P’s  state  space. 

•  A  graph  G,  called  the  abstract  graph ,  obtained  from  P  via  abstraction  (and  abstraction 
refinement).  G  over  approximates  P’s  state  space. 

Nodes  in  G  are  labeled  with  formulas;  edges  are  labeled  with  program  statements  or  program 
conditions.  One  node  is  the  start  node  (where  execution  begins);  another  node  is  the  target  node 
(the  goal  to  reach).  Information  to  relate  the  under-  and  overapproximations  is  also  maintained:  a 
concrete  state  a  in  a  trace  in  T  is  called  a  witness  for  a  node  n  in  G  if  a  satisfies  the  formula  that 
labels  n. 

If  G  has  no  path  from  start  to  target,  then  DPG  has  proved  that  target  is  unreachable,  and  G 
serves  as  the  proof.  Otherwise,  DPG  locates  a  frontier:  a  triple  (n,  /,  m),  where  (n,  m)  is  an  edge 
on  a  path  from  start  to  target  such  that  n  has  a  witness  w  but  m  does  not,  and  /  is  the  instruction 
on  (n,  m).  DPG  either  performs  concrete  execution  (attempting  to  reach  target)  or  refines  G  by 
splitting  nodes  and  removing  certain  edges  (which  may  prove  that  target  is  unreachable).  Which 
action  to  perform  is  determined  using  the  basic  step  from  directed  test  generation  [94],  which  uses 
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Figure  5.1  The  general  refinement  step  across  frontier  (n,  I,  in).  The  presence  of  a  witness  is 

indicated  by  a  inside  of  a  node. 

symbolic  execution  to  try  to  find  an  input  that  allows  execution  to  cross  frontier  (n,  /,  m).  Sym¬ 
bolic  execution  is  performed  over  symbolic  states,  which  have  two  components:  a  path  constraint , 
which  represents  a  constraint  on  the  input  state,  and  a  symbolic  map,  which  represents  the  current 
state  in  terms  of  input-state  quantities.  DPG  performs  symbolic  execution  along  the  path  taken 
during  the  concrete  execution  that  produced  witness  w  for  n;  it  then  symbolically  executes  I,  and 
conjoins  to  the  path  constraint  the  formula  obtained  by  evaluating  rn’ s  predicate  ip  with  respect  to 
the  symbolic  map.  It  calls  an  SMT  solver  to  determine  if  the  path  constraint  obtained  in  this  way 
is  satisfiable.  If  so,  the  result  is  a  satisfying  assignment  that  is  used  to  add  a  new  execution  trace  to 
T.  If  not,  DPG  refines  G  by  splitting  node  n  into  n'  and  n",  as  shown  in  Fig.  5.1. 

Refinement  changes  G  to  represent  some  non-connectivity  information:  in  particular,  n'  is  not 
connected  to  m  in  the  refined  graph  (see  Fig.  5.1).  Let  ip  be  the  formula  that  labels  m,  c  be  the 
concrete  witness  of  n,  and  Sn  be  the  symbolic  state  obtained  from  the  symbolic  execution  up  to 
n.  DPG  chooses  a  formula  p,  called  the  refinement  predicate,  and  splits  node  n  into  n'  and  n" 
to  distinguish  the  cases  when  n  is  reached  with  a  concrete  state  that  satisfies  p  in")  and  when  it 
is  reached  with  a  state  that  satisfies  ->p  in').  The  predicate  p  is  chosen  such  that  (i)  no  state  that 
satisfies  ->p  can  lead  to  a  state  that  satisfies  w  after  the  execution  of  I,  and  (ii)  the  symbolic  state 
Sn  satisfies  ->p.  Condition  (i)  ensures  that  the  edge  from  n'  to  m  can  be  removed.  Condition  (ii) 
prohibits  extending  the  current  path  along  /  (forcing  the  DPG  search  to  explore  different  paths).  It 
also  ensures  that  c  is  a  witness  for  n!  and  not  for  n"  (because  c  satisfies  Sn) — and  thus  the  frontier 
during  the  next  iteration  must  be  different. 
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5.1.2  McVeto 

In  this  section,  we  focus  on  explaining  the  language-independent  algorithm  that  we  developed 
to  identify  the  aliasing  condition  relevant  to  a  property  in  a  given  state  (§5. 1.2.1),  and  the  mecha¬ 
nisms  to  discover  candidate  invariants  from  a  trace,  which  are  then  incorporated  into  the  abstract 
graph  (§5. 1.2. 2).  The  details  of  contributions  1,  2,  3,  and  4  listed  in  the  introduction  to  §5.1  can  be 
found  in  the  full  paper  ([174,  175]). 

5.1.2.1  A  Language-Independent  Approach  to  Aliasing  Relevant  to  a  Property 

This  section  describes  how  MCVETO  identifies — in  a  language-independent  way  suitable  for 
use  with  machine  code — the  aliasing  condition  relevant  to  a  property  in  a  given  state  (contribution 
5  from  the  introduction  to  §5.1).  Chapter  4  showed  how  to  generate  a  pre-image  primitive  Pre  for 
machine  code;  however,  repeated  application  of  Pre  causes  refinement  predicates  to  explode.  We 
now  present  a  language-independent  algorithm  for  obtaining  an  aliasing  condition  a  that  is  suitable 
for  use  in  machine-code  analysis.  From  a,  one  immediately  obtains  Pre,,  .  There  are  two  challenges 
to  defining  an  appropriate  notion  of  aliasing  condition  for  use  with  machine  code:  (i)  int- valued 
and  address-valued  quantities  are  indistinguishable  at  runtime,  and  (ii)  arithmetic  on  addresses  is 
used  extensively. 

Suppose  that  the  frontier  is  ( n,I,m ),  w  is  the  formula  on  m,  and  Sn  is  the  symbolic  state 
obtained  via  symbolic  execution  of  a  concrete  trace  that  reaches  n.  For  source  code,  Beckman 
et  al.  [49]  identify  aliasing  condition  a  by  looking  at  the  relationship,  in  Sn,  between  the  ad¬ 
dresses  written  to  by  /  and  the  ones  used  in  t/j.  However,  their  algorithm  for  computing  a  is 
languag e-dependent:  their  algorithm  has  the  semantics  of  C  implicitly  encoded  in  its  search  for 
“the  addresses  written  to  by  /”.  In  contrast,  as  explained  below,  we  developed  an  alternative, 
languag  ^.-independent  approach,  both  to  identifying  a  and  computing  Prc0. 

For  the  moment,  to  simplify  the  discussion,  suppose  that  a  concrete  machine-code  state  is 
represented  using  two  maps  M  :  INT  — >  INT  and  R  :  REG  — >  INT.  Map  M  represents 
memory,  and  map  R  represents  the  values  of  machine  registers.  (A  more  realistic  definition  of 
memory  is  considered  later  in  this  section.)  We  use  the  standard  theory  of  arrays  to  describe 
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(functional)  updates  and  accesses  on  maps,  e.g.,  update(m,  k,  d )  denotes  the  map  m  with  index 
k  updated  with  the  value  d,  and  access  (rn.  k )  is  the  value  stored  at  index  k  in  m.  (We  use  the 
notation  m(r )  as  a  shorthand  for  access(m,r ).)  We  also  use  the  standard  axiom  from  the  the¬ 
ory  of  arrays:  (update(m,  ki,  d))(k2)  =  ite(k\  =  k2,d,m(k2)),  where  ite  is  an  if-then-else  term. 
Suppose  that  /  is  “mov  [eax]  ,5”  (which  corresponds  to  *eax  =  5  in  source-code  notation)  and 
that  0  is  (M(R(e bp)  —  8)  +  M(R(e bp)  —  12)  =  10). 3  First,  we  symbolically  execute  I  start¬ 
ing  from  the  identity  symbolic  state  =  [M  i— >  M,  R  i— *  R\  to  obtain  the  symbolic  state 

S'  =  [M  i — >  updale{M.  /ffeax).  5).  R  i— »•  R].  Next,  we  evaluate  0  under  S' — i.e.,  perform  the 
substitution  0[M  S'(M),  R  <—  S'(R)].  For  instance,  the  term  M(R(e bp)  —  8),  which  denotes 

the  contents  of  memory  at  address  R(e bp)  —  8,  evaluates  to  ( update (M ,  R(e ax),  5))(f?(ebp)  —  8). 
From  the  axiom  for  arrays,  this  simplifies  to  ite(R(e ax)  =  R(e bp)  —  8,  5,  M(R(e bp)  —  8)).  Thus, 
the  evaluation  of  0  under  S'  yields 

/  ite(R(e ax)  =  R(e bp)  -  8,  5,  M(i?(ebp)  -  8))  \  _  ^ 
y+  ite(R(esLx)  =  i?(ebp)  —  12,  5,  M(i?(ebp)  —  12))  J 
This  formula  equals  Pre(J,  0)  as  discussed  in  [125]  and  Chapter  4. 

The  process  described  above  illustrates  a  general  property :  for  any  instruction  /  and  formula 
-0,  Pre(J,  0)  =  0[M  S'(M),R  <—  S'(R)],  where  S'  =  SE[[/]50  and  SEJ-]  denotes  symbolic 

execution  [125]. 

The  next  steps  are  to  identify  a  and  to  create  a  simplified  formula  0'  that  weakens  Pre(J,  0). 
These  are  carried  out  simultaneously  during  a  traversal  of  Pre(J,  0)  that  makes  use  of  the  symbolic 
state  Sn  at  node  n.  We  illustrate  this  on  the  example  discussed  above  for  a  case  in  which  Sn(R )  = 
[eax  i — >  R(e bp)  —  8]  (i.e.,  continuing  the  scenario  from  footnote  3,  eax  holds  &x).  Because 
the  fte-terms  in  Eqn.  (5.1)  were  generated  from  array  accesses,  //^-conditions  represent  possible 
constituents  of  aliasing  conditions.  We  initialize  a  to  true  and  traverse  Eqn.  (5.1).  For  each  subterm 
t  of  the  form  ite((p,  0 ,  t2)  where  p  definitely  holds  in  symbolic  state  Sn,  t  is  simplified  to  0  and  p 
is  conjoined  to  a.  If  p  can  never  hold  in  Sn,  t  is  simplified  to  t2  and  ->p  is  conjoined  to  a.  If  p  can 
sometimes  hold  and  sometimes  fail  to  hold  in  Sn,  t  and  a  are  left  unchanged. 

3In  x86,  ebp  is  the  frame  pointer,  so  if  program  variable  x  is  at  offset  -8  and  y  is  at  offset  -12,  0  corresponds  to 
x  +  y  =  10. 


157 


In  our  example,  i?(eax)  equals  R(e bp)  —  8  in  symbolic  state  Sn;  hence,  applying  the  process 
described  above  to  Eqn.  (5.1)  yields 


ip'  =  (5  +  M(R(e  bp)  —  12)  =  10) 
a  =  (R(e ax)  =  f?(ebp)  —  8)  A  (i?(e ax)  ^  f?(ebp)  —  12) 


(5.2) 


The  formula  a  ip'  is  the  desired  refinement  predicate  PreQ(J,  ip). 

In  practice,  we  found  it  beneficial  to  use  an  alternative  approach,  which  is  to  perform  the  same 
process  of  evaluating  conditions  of  ite  terms  in  Pre(J,  ip),  but  to  use  one  of  the  concrete  witness 
states  Wn  of  frontier  node  n  in  place  of  symbolic  state  Sn.  The  latter  method  is  less  expensive  (it 
uses  formula-evaluation  steps  in  place  of  SMT  solver  calls),  but  generates  an  aliasing  condition 
specific  to  Wn  rather  than  one  that  covers  all  concrete  states  described  by  Sn. 

Both  approaches  are  language-independent  because  they  isolate  where  the  instruction-set  se¬ 
mantics  comes  into  play  in  Pr e(I,ip)  to  the  computation  of  S'  =  SE[[/J,S7/;  all  remaining  steps 
involve  only  purely  logical  primitives.4  Although  our  algorithm  computes  Pre(J,  ip)  explicitly,  that 
step  alone  does  not  cause  an  explosion  in  formula  size;  explosion  is  due  to  repeated  application  of 
Pre.  In  our  approach,  the  formula  obtained  via  Pre(J,  ip)  is  immediately  simplified  to  create  first 
ip',  and  then  a  =>  ip'. 

Byte- Addressable  Memory.  We  assumed  above  that  the  memory  map  has  type  INT  — >  INT.  When 
memory  is  byte-addressable,  the  actual  memory-map  type  is  INT32  — >  INT8.  This  complicates 
matters  because  accessing  (updating)  a  32-bit  quantity  in  memory  translates  into  four  contiguous 
8-bit  accesses  (updates).  For  instance,  a  32-bit  little-endian  access  can  be  expressed  as  follows: 


access a)  =  let  vA  =  224  *  Int8To32ZE(m(a  +  3)) 

v3  =  216  *  Int8To32ZE(m(a  +  2)) 
v2  =  2s*  Int8To32ZE(m(a  +  1))  (5.3) 

v\  =  Int8To32ZE(m{a)) 
in  (u4  |  v3  |  v2  \  nl) 


4 A  system  for  DPG  needs  the  symbolic-execution  primitive  SE[/]  anyway  for  other  steps  of  state-space  explo¬ 
ration.  Because  an  implementation  of  SE[/]  can  be  generated  from  a  description  of  the  semantics  of  an  instruction  set 
([125]  and  Chapter  4),  an  implementation  of  PreQ  (I,  ip)  can  be  generated  as  well. 
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^  224  *  Int8To32ZE(ite(x  +  3  =  p  +  3, 0,  ite(x  +  3  =  p  +  2,  0,  ite(x  +  3  =  p  +  1,  0,  ite(x  +  3  =  p,  5,  *(#  +  3))))))^ 

|  216  *  Int8To32ZE(ite(x  +  2  =  p  +  3, 0,  ite(pc  +  2  =  p  +  2,  0,  ite(x  +  2  =  p  +  1,  0,  ite(x  +  2  =  p,  5,  *(#  +  2)))))) 

|  28  *  Int8To32ZE(ite(x  +  1  =  p  +  3,  0,  ite(pc  +  1  =  p  +  2, 0,  ite(pc  +  1  =  p  +  1,  0,  ite(x  +  1  =  p,  5,  *(x  +  1)))))) 

\|  Int8To32ZE(ite(x  =  p  +  3, 0,  ite{x  =  p  +  2,  0,  ite(x  =  p  +  1,0,  ite(pc  =  p,  5,  *®)))))  / 

(  224  *  Int8To32ZE(ite(y  +  3  =  p  +  3,  0,  ite(y  +  3  =  p  +  2,  0,  ite(y  +  3  =  p  +  1,  0,  ite(y  +  3  =  p,  5,  *(p  +  3))))))^ 

|  216  *  Int8To32ZE(ite(y  +  2  =  p  +  3,  0,  ite(y  +  2  =  p  +  2,  0,  ite(y  +  2  =  p  +  1,  0,  ite(y  +  2  =  p,  5,  *(p  +  2)))))) 

|  28  *  Int8To32ZE(ite(y  +  1  =  p  +  3, 0,  ite(y  +  1  =  p  2, 0,  ite(y  +  1  =  p  +  1, 0,  ite(y  +  1  =  p,  5,  *(p  +  1)))))) 

\|  Int8To32ZE(ite(y  =  p  +  3, 0,  ite(y  =  p  +  2, 0,  jte(y  =  p  +  1, 0,  ite(y  =  p,  5,  *p)))))  / 

=  10 


Figure  5.2  The  formula  for  Pre(J,  ijj),  where  w  is 
update _32 _8_L£'_32(M,  f?(ebp)  —  8)  +  update _32_8_L£'_32(M,  f?(ebp)  —  12)  =  10,  obtained  by 
evaluating  i/j  on  the  symbolic  state  ,S'7  =  [M  i— >  update _32_8_L£'_32(M,  /?(eax),  5).  /?  ^  /?|.  For 
brevity,  the  following  notational  shorthands  are  used  in  the  formula:  p  =  f?(e ax), 
x  =  i?(ebp)  —  8,  y  =  i?(ebp)  —  12,  *2;  =  M(f?(ebp)  —  8),  =  M(f?(ebp)  —  12),  etc. 

where  Int8To32ZE  converts  an  INT8  to  an  INT32  by  padding  the  high-order  bits  with  zeros,  and 
“|”  denotes  bitwise-or. 

Let  update _32 _8_L£’_32  denote  the  similar  operation  for  updating  a  map  of  type  INT32  — >  INT8 
under  the  little-endian  storage  convention.  Note  that  when  1  <  \k\  —jnt32  k‘i  <3,  we  no  longer 
have  the  property 

accessJ32_8_LE_32(updateJ32_8_LE_32(M,  k \.d).  k2 )  =  a ccess _3 2 _8 _LL_3 2 ( M .  k2). 
and  hence  it  is  invalid  to  simplify  formulas  by  the  rule 

access  332  _8_LEJ32(update  332  _8_LEJ32(M1  k\  .d).  k2 ) 

ite(ki  =  k2,  d,  access J$2_8_LE_32(A/I,  k2)). 

However,  the  four  single-byte  accesses  on  m  in  Eqn.  (5.3)  (m(a),  m(a+l),  m(a+2),  and  m(a+3)) 
are  access  operations  for  which  it  is  valid  to  apply  the  standard  axiom  of  arrays  (i.e.,  {rn{k\  1— > 
d])(k2)  =  ite(k1  =  k2,  d,  m(k2))). 

Returning  to  the  example  discussed  above,  in  which  R(e ax)  equals  f?(ebp)  —  8  in  symbolic 
state  Sn,  we  perform  the  same  steps  as  before.  First,  the  symbolic  execution  of  I  =  mov  [eax]  ,  5 
starting  from  the  identity  symbolic  state  =  [M  1— >  AT,  R  1— >  R]  results  in  the  symbolic  state 


S'  =  [M  1— >•  update33238-LE332(M,  R(e ax),  5),  R  1— >  R\. 
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The  formula  ip  is  now  written  as  follows: 


access _32_8_L£_32(M ,  R(e bp)  —  8)  +  access  J32_8_LE_32(M ,  R(e bp)  —  12)  =  10. 

To  obtain  Pre(J,  ip),  we  evaluate  ip  under  S',  which  yields  the  formula  shown  in  Fig.  5.2. 

The  formula  shown  in  Fig.  5.2  is  the  analog  of  Eqn.  (5.1). 

The  step  that  uses  symbolic  state  Sn  to  identify  a  and  create  a  simplified  formula  ip’  that 
weakens  Pre(J,  ip)  is  now  applied  to  the  formula  shown  in  Fig.  5.2  and  produces 


1  224  *  Int8To32ZE(*(y  +  3))^ 
|  216  *  Int8To32ZE(*(y  +  2)) 

|  28  *  Int8To32ZE(*(y  +  1)) 
yj  Int8To32ZE(*y)  j 


10. 


The  a  that  is  the  analog  of  Eqn.  (5.2)  is  the  conjunction  of  the  disequalities  collected  from  the 
formula  shown  in  Fig.  5.2: 


a  =  x  +  3fp  +  3A...x  +  3fpA...xfp  +  3A...xfp 
Ay  +  3fp  +  3A...y  +  3fpA...yfp  +  3A...yfp. 

As  before,  the  formula  a  ^  ip'  is  the  desired  refinement  predicate  Prc a  ( / ,  ip). 


5.1.2.2  Speculative  Trace  Refinement 

Motivated  by  the  observation  that  DPG  is  able  to  avoid  exhaustive  loop  unrolling  if  it  discovers 
the  right  loop  invariant,  we  developed  mechanisms  to  discover  candidate  invariants  from  a  trace,5 
which  are  then  incorporated  into  the  abstract  graph.  Although  they  are  only  candidate  invariants, 
they  are  introduced  into  the  abstract  graph  in  the  hope  that  they  are  invariants  for  the  full  program. 
The  basic  idea  is  to  apply  dataflow  analysis  to  a  graph  obtained  from  the  trace  Gn.  The  recovery  of 
invariants  from  Gn  is  similar  in  spirit  to  the  computation  of  invariants  from  traces  in  Daikon  [84], 
but  in  MCVETO  they  are  computed  ex  post  facto  by  dataflow  analysis  on  the  trace.  While  any  kind 
of  dataflow  analysis  could  be  used  in  this  fashion,  MCVETO  currently  uses  two  analyses: 

sThe  trace  is  folded  by  grouping  together  all  nodes  with  the  same  effective  address,  and  augmenting  it  in  a  way 
that  overapproximates  the  portion  of  the  program  not  explored  by  the  trace  (see  [174,  175]  for  more  details). 
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•  Affine-relation  analysis  (§3.3.2  and  [141])  is  used  to  obtain  linear  equalities  over  registers 
and  a  set  of  memory  locations,  V.  V  is  computed  by  running  aggregate  structure  identifica¬ 
tion  [156]  on  to  obtain  a  set  of  inferred  memory  variables  M,  then  selecting  V  C  M  as 
the  most  frequently  accessed  locations  in  ir. 

•  An  analysis  based  on  strided-interval  arithmetic  (§3.3.4  and  [160])  is  used  to  discover  range 
and  congruence  constraints  on  the  values  of  individual  registers  and  memory  locations. 

The  candidate  invariants  are  used  to  create  predicates  for  the  nodes  of  G^.  Because  an  analysis 
may  not  account  for  the  full  effects  of  indirect  memory  references  on  the  inferred  variables,  to 
incorporate  a  discovered  candidate  invariant  p  for  node  n  into  Gn  safely,  we  split  n  on  p  and 
->p.  Again  we  have  two  overapproximations:  Gn,  from  the  trace,  augmented  with  the  candidate 
invariants,  and  the  original  abstract  graph  G.  To  incorporate  the  candidate  invariants  into  G',  we 
perform  G  :=  G  fl  Gn;  the  (T  operation  labels  a  product  state  (g1?  q2)  with  the  conjunction  of  the 
predicates  on  states  q\  of  G  and  q2  of  G, r. 

5.1.3  Implementation 

The  MCVETO  implementation  incorporates  all  of  the  techniques  described  in  §5.1.2.  The 
implementation  uses  only  language-independent  techniques;  consequently,  MCVETO  can  be  easily 
retargeted  to  different  languages.  The  main  components  of  MCVETO  are  language-independent  in 
two  different  dimensions: 

1 .  The  MCVETO  DPG  driver  is  structured  so  that  one  only  needs  to  provide  implementations  of 
primitives  for  performing  concrete  and  symbolic  execution  of  a  language’s  constructs,  plus  a 
handful  of  other  primitives  (e.g.,  Prcfl).  Consequently,  this  component  can  be  used  for  both 
source-level  languages  and  machine-code  languages. 

2.  For  machine-code  languages,  we  used  two  tools  that  generate  the  required  implementations 
of  the  primitives  for  concrete  and  symbolic  execution  from  descriptions  of  the  syntax  and 
concrete  operational  semantics  of  an  instruction  set.  The  abstract  syntax  and  concrete  seman¬ 
tics  are  specified  using  TSL.  Translation  of  binary-encoded  instructions  to  abstract  syntax 
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trees  is  specified  using  a  tool  called  ISAL  (Instruction  Set  Architecture  Language).6  The  rela¬ 
tionship  between  ISAL  and  TSL  is  similar  to  the  relationship  between  Flex  and  Bison — i.e.,  a 
Flex- generated  lexer  passes  tokens  to  a  Bison-generated  parser.  In  our  case,  the  TSL-defined 
abstract  syntax  serves  as  the  formalism  for  communicating  values — namely,  instructions’ 
abstract  syntax  trees — between  the  two  tools. 

In  addition,  we  developed  language-independent  solutions  to  each  of  the  issues  in  MCVETO,  such 
as  identifying  the  aliasing  condition  relevant  to  a  specific  property  in  a  given  state  (§5. 1.2.1).  Con¬ 
sequently,  our  implementation  acts  as  a  “YACC-like”  tool  for  creating  versions  of  MCVETO  for 
different  languages:  given  a  description  of  language  L,  a  version  of  MCVETO  for  L  is  generated 
automatically.  We  created  two  specific  instantiations  of  MCVETO  from  descriptions  of  the  Intel 
x86  and  PowerPC  instruction  sets.  To  perform  symbolic  queries  on  the  conceptually-infinite  ab¬ 
stract  graph  (see  [174,  175]  for  details),  the  implementation  uses  OpenFst  [33]  (for  transducers) 
and  WALi  [1 14]  (for  WPDSs). 

5.1.4  Experiments 

Our  experiments  (see  Fig.  5.15)  were  run  on  a  single  core  of  a  single-processor  quad-core 
3.0  GHz  Xeon  computer  running  Windows  XP,  configured  so  that  a  user  process  has  4  GB  of 
memory.  They  were  designed  to  test  various  aspects  of  a  DPG  algorithm  and  to  handle  various 
intricacies  that  arise  in  machine  code  (some  of  which  are  not  visible  in  source  code).  We  compiled 
the  programs  with  Visual  Studio  8.0,  and  ran  MCVETO  on  the  resulting  object  files  (without  using 
symbol-table  information).7 

The  examples  ex5,  ex6,  and  ex8  are  from  the  NECLA  Static  Analysis  Benchmarks.8  The 
examples  barber,  berkeley,  cars,  efm  are  multi-procedure  versions  of  the  larger  examples  on 
which  SYNERGY  [98]  was  tested.  (SYNERGY  was  tested  using  single-procedure  versions  only.9) 

6lSAL  also  handles  other  kinds  of  concrete  syntactic  issues,  including  (a)  encoding  (abstract  syntax  trees  to  binary- 
encoded  instructions),  (b)  parsing  assembly  (assembly  code  to  abstract  syntax  trees),  and  (c)  assembly  pretty-printing 
(abstract  syntax  trees  to  assembly  code). 

'The  examples  are  available  at  www.  cs .  wise .  edu/wpis/examples/McVeto  . 

8www  .nec-labs  .  com/research/ system/ systems_SAV-website/benchmarks  .php 

9www .  cse  .  iitb .  ac  .  in/^bhargav/ synergy 
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Program 

MCVETO  perfomiance  (x86) 

Name 

Outcome 

#Instrs 

time 

blast2/blast2 

timeout 

326 

** 

fib/fib-REACH-0 

timeout 

287 

** 

fib/fib-REACH- 1 

counterex. 

287 

0.07 

slam  1 /slam  1 

proof 

290 

61.85 

smcl/smcl  -RE  ACH  -0  * 

proof 

21 

959 

smcl/smcl  -REACH  - 1  * 

counterex. 

21 

0.016 

ex5/ex 

counterex. 

270 

0.18 

doubleloopdep/count-COUNT-5 

counterex. 

252 

1.09 

doubleloopdep/count-COUNT-6 

counterex. 

252 

1.08 

doubleloopdep/count-COUNT-7 

counterex. 

252 

1.21 

doubleloopdep/count-COUNT-8 

counterex. 

252 

1.51 

doubleloopdep/count-COUNT-9 

counterex. 

252 

2.82 

inter,  synergy/barber 

timeout 

454 

2.02 

inter,  synergy/berkeley 

counterex. 

305 

** 

inter,  synergy/cars 

proof 

378 

5.13 

inter,  synergy/efm 

timeout 

403 

** 

share/share-CASE-0 

proof 

262 

93.95 

stress/diamonds-SHORT 

proof 

257 

0.27 

cert/underflow 

counterex. 

323 

0.52 

instraliasing/instraliasing-REACH-0 

proof 

46 

15.0 

instraliasing/instraliasing-REACH-1 

counterex. 

46 

5.86 

longjmp/jmp 

AE  viol. 

74 

0.015 

overviewO/overview 

proof 

49 

54.9 

small  .static  _bench/ex5 

proof 

251 

0.13 

small  .static  _bench/ex6 

proof 

259 

1.93 

small  .static  _bench/ex8 

proof 

297 

4.6 

verisec-gxine/simp_bad 

counterex. 

1067 

0.094 

verisec-gxine/simp.ok 

proof 

1068 

** 

clobber  _ret_addr/clobber-CASE-4 

AE  viol. 

43 

2.13 

clobber  _ret_addr/clobber-CASE-8 

AE  viol. 

35 

0.625 

clobber  _ret_addr/clobber-CASE-9 

proof 

35 

1.44 

Figure  5.3  MCVETO  experiments.  The  columns  show  whether  MCVETO  returned  a  proof, 
counterexample,  or  an  AE  violation  (Outcome);  the  number  of  instructions  (#Instrs);  the  number 
of  concrete  executions  (CE);  the  number  of  symbolic  executions  (SE),  which  also  equals  the 
number  of  calls  to  the  YICES  solver;  the  number  of  refinements  (Ref),  which  also  equals  the 
number  of  Prea  computations;  and  the  total  time  (in  seconds).  *SMC  test  case.  **Exceeded 

twenty-minute  time  limit. 


Instraliasing  illustrates  the  ability  to  handle  instruction  aliasing.  (The  instruction  count  for  this 
example  was  obtained  via  static  disassembly,  and  hence  is  only  approximate.)  Smcl  illustrates  the 
ability  of  MCVETO  to  handle  self-modifying  code.  Underflow  is  taken  from  a  DHS  tutorial  on 
security  vulnerabilities.  It  illustrates  a  strncpy  vulnerability. 


163 


The  examples  are  small,  but  challenging.  They  demonstrate  MCVETO’s  ability  to  reason  au¬ 
tomatically  about  low-level  details  of  machine  code  using  a  sequence  of  sound  abstractions.  The 
question  of  whether  the  cost  of  soundness  is  inherent,  or  whether  there  is  some  way  that  the  well- 
behavedness  of  (most)  code  could  be  exploited  to  make  the  analysis  scale  better  is  left  for  future 
research. 

5.1.5  Related  Work 

Machine-Code  Analyzers  Targeted  at  Finding  Vulnerabilities.  A  substantial  amount  of  work 
exists  on  techniques  to  detect  security  vulnerabilities  by  analyzing  source  code  for  a  variety  of 
languages  [129,  180,  185].  Less  work  exists  on  vulnerability  detection  for  machine  code.  Kruegel 
et  al.  [118]  developed  a  system  for  automating  mimicry  attacks;  it  uses  symbolic  execution  of 
machine  code  to  discover  attacks  that  can  give  up  and  regain  execution  control  by  modifying  the 
contents  of  the  data,  heap,  or  stack  so  that  the  application  is  forced  to  return  control  to  injected 
attack  code  at  some  point  after  the  execution  of  a  system  call.  Cova  et  al.  [75]  used  that  platform 
to  detect  security  vulnerabilities  in  x86  executables  via  symbolic  execution. 

Prior  work  exists  on  directed  test  generation  for  machine  code  [55,  95] .  Directed  test  generation 
combines  concrete  execution  and  symbolic  execution  to  find  inputs  that  increase  test  coverage.  An 
SMT  solver  is  used  to  obtain  inputs  that  force  previously  unexplored  branch  directions  to  be  taken. 
In  contrast,  MCVETO  implements  directed  proof  generation  for  machine  code.  Unlike  directed- 
test-generation  tools,  MCVETO  is  goal-directed,  and  works  by  trying  to  refute  the  claim  “no  path 
exists  that  connects  program  entry  to  a  given  goal  state”. 

Machine-Code  Model  Checkers.  SYNERGY  applies  to  an  x86  executable  for  a  “single-procedure 
C  program  with  only  [int-valued]  variables”  [98]  (i.e.,  no  pointers).  It  uses  debugging  information 
to  obtain  information  about  variables  and  types,  and  uses  Vulcan  [173]  to  obtain  a  CFG.  It  uses 
integer  arithmetic — not  bit-vector  arithmetic — in  its  solver.  Quoting  A.  Nori,  “[[98]  handles]  the 
complexities  of  binaries  via  its  front-end  Vulcan  and  not  via  its  property-checking  engine”  [150]. 
In  contrast,  MCVETO  addresses  the  challenges  of  checking  properties  of  stripped  executables  ar¬ 
ticulated  in  Chapter  2. 
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AIR  (“Assembly  Iterative  Refinement”)  [61]  is  a  model  checker  for  PowerPC.  AIR  decompiles 
an  assembly  program  to  C,  and  then  checks  if  the  resulting  C  program  satisfies  the  desired  property 
by  applying  COPPER  [60],  a  predicate-abstraction-based  model  checker  for  C  source  code.  They 
state  that  the  choice  of  COPPER  is  not  essential,  and  that  any  other  C  model  checker,  such  as 
SLAM  [47]  or  BLAST  [102]  would  be  satisfactory.  However,  the  C  programs  that  result  from  their 
translation  step  use  pointer  arithmetic  and  pointer  dereferencing,  whereas  many  C  model  checkers, 
including  SLAM  and  BLAST,  make  unsound  assumptions  about  pointer  arithmetic. 

[MC]SQUARE  [165]  is  a  model  checker  for  microcontroller  assembly  code.  It  uses  explicit- 
state  model-checking  techniques  (combined  with  a  degree  of  abstraction)  to  check  CTL  properties. 

Our  group  developed  two  prior  machine-code  model  checkers,  CodeSurfer/x86  [44]  and 
DDA/x86  [43].  Neither  system  uses  either  underapproximation  or  symbolic  execution.  For  over¬ 
approximation,  both  use  numeric  static  analysis  and  a  different  form  of  abstraction  refinement  than 
the  one  used  in  MCVETO. 

Self-Modifying  Code.  The  work  on  MCVETO  addresses  a  problem  that  has  been  almost  entirely 
ignored  by  the  PL  research  community.  There  is  a  paper  on  SMC  by  Gerth  [90],  and  a  recent  paper 
by  Cai  et  al.  [59].  However,  both  of  the  papers  concern  proof  systems  for  reasoning  about  SMC. 
In  contrast,  MCVETO  can  verify  (or  detect  flaws  in)  SMC  automatically. 

As  far  as  we  know,  MCVETO  is  the  first  model  checker  to  address  verifying  (or  detecting  flaws 
in)  SMC. 

5.1.6  Conclusion 

MCVETO  resolves  many  issues  that  have  been  unsoundly  ignored  in  previous  work  on  soft¬ 
ware  model  checking.  MCVETO  addresses  the  challenge  of  establishing  properties  of  the  machine 
code  that  actually  executes,  and  thus  provides  one  approach  to  checking  the  effects  of  compilation 
and  optimization  on  correctness.  The  contributions  of  the  work  described  in  §5.1.2  lie  in  the  in¬ 
sights  that  went  into  defining  the  innovations  in  dynamic  and  symbolic  analysis  used  in  MCVETO: 
(i)  sound  disassembly  and  sound  construction  of  an  overapproximation  (even  in  the  presence  of 
instruction  aliasing  and  self-modifying  code)  (see  [174]  for  the  details),  (ii)  a  new  method  to 


165 


eliminate  families  of  infeasible  traces  (see  [174]  for  the  details),  (iii)  a  method  to  speculatively, 
but  soundly,  elaborate  the  abstraction  in  use  (§5. 1.2.2),  (iv)  new  symbolic  methods  to  query  the 
(conceptually  infinite)  abstract  graph  (see  [174]  for  the  details),  and  (v)  a  language-independent 
approach  to  PreQ  (§5. 1.2.1).  Not  only  are  our  techniques  language-independent,  the  implementa¬ 
tion  is  parameterized  by  specifications  of  an  instruction  set’s  semantics.  By  this  means,  MCVETO 
has  been  instantiated  for  both  x86  and  PowerPC. 

5.2  BCE 

As  discussed  in  §1.5.4,  an  increasing  number  of  individual  Internet  sites  have  been  compro¬ 
mised  by  attacks  from  across  the  world  to  become  part  of  various  kinds  of  malicious  botnets.  The 
Internet  security  research  community  has  made  significant  efforts  to  identify  botnets,  to  collect 
data  on  their  activities,  and  to  develop  techniques  for  detection,  mitigation,  and  disruption. 

We  have  developed  a  tool  called  BCE  (Botnet-Command  Extractor)  for  extracting  botnet- 
command  information  from  bot  executables.  BCE  aims  to  provide  useful  information  from  anal¬ 
ysis  of  bot  executables  by  automatically  extracting  proper  inputs  that  trigger  malicious  behavior. 
Applications  of  the  information  recovered  include  observing  and  analyzing  malicious  behaviors, 
as  well  as  identifying  and  mitigating  botnets. 

A  typical  way  to  analyze  the  behavior  of  a  bot  is  to  run  the  executable  and  observe  its  actions. 
To  carry  this  out,  however,  one  needs  proper  inputs  to  trigger  malicious  behaviors.  Some  widely- 
known  commands  are  often  used  for  this  purpose.  However,  attackers  can  easily  change  their 
commands  to  evade  such  dynamic  analysis.  Also,  it  is  a  hard  problem  to  obtain  such  inputs  by 
manually  stepping  through  the  executable.  BCE  automates  the  extraction  of  information  about 
botnet  commands  and  the  arguments  to  commands. 

The  work  described  in  the  section  makes  the  following  contributions: 

1.  BCE  automatically  extracts  botnet-command  information  from  bot  executables,  without 
source  code  or  symbol-table/debugging  information.  The  extracted  information  includes 
(a)  constant  command  strings  that  trigger  API-level  behaviors,  (b)  relationships,  including 
type  relationships,  between  the  input  command  string  and  the  actual  parameters  of  an  API 
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[1]  .. 

[1] 

procedure  foo 

[2]  else  if (strcmp (cmd, c  1 : !p’ ’ ) 

==0)  { 

[2] 

push  offset  aPl; 

!  P  ’  ’ 

[3] 

//  (1) 

[3] 

lea  eax,  [ebp+arg_0] 

[4]  } 

[4] 

push  eax 

[5]  else  if (strcmp (cmd, c  c : !p2’ } 

)==0)  { 

[5] 

call  strcmp 

[6] 

//  (2) 

[7]  } 

[6] 

add  esp,  OCh 

[8]  else  if (strcmp(cmd, * * : !ppp* 

’)==0)  { 

[7] 

or  eax,  eax 

[9] 

//  (3) 

[8] 

jnz  short  loc_402210 

[10]} 

[9] 

.  ...  //  (1) 

[10] 

push  offset  aPl; 

T3 

to 

[11] 

lea  eax,  [ebp+arg_0] 

[1]  . 

[12] 

push  eax 

[2]  else  if(*cmd++  == 

[13] 

call  strcmp 

[3] 

&&  *cmd++  ==  *!* 

[14] 

add  esp,  OCh 

[4] 

&&  *cmd++  ==  ‘p’)  { 

[IS] 

or  eax,  eax 

[5] 

if(*cmd  ==  0) 

[16] 

jnz  short  loc_402210 

[6] 

//  (1) 

[17] 

.  ...  //  (2) 

[7] 

else  if(*cmd  ==  ‘2’) 

[18] 

push  offset  aPl; 

■  ppp ’ ’ 

[8] 

//  (2) 

[19] 

lea  eax,  [ebp+arg_0] 

[9] 

else  if(*cmd++  ==  cp’ 

[10] 

&&  *cmd++  ==  ‘p’) 

[20] 

push  eax 

[11] 

//  (3) 

[21] 

call  strcmp 

[12]} 

[22] 

add  esp,  OCh 

[23] 

or  eax,  eax 

[24] 

jnz  short  loc_402210 

[25] 

.  ...  //  (3) 

Figure  5.4  (a)  (top  left)  A  snippet  of  the  EvilBot  source  code,  (b)  (bottom  left)  alternative  source  code, 

(c)  (right)  the  assembly  code  of  (a). 


call,  and  (c)  constraints  on  the  actual  parameters  of  an  API  call.  The  information  obtained 
via  BCE  can  be  used  to  build  up  proper  input  commands  that  trigger  API-level  behaviors. 

2.  BCE  is  able  to  provide  a  specification  of  the  API- level  behaviors  of  a  bot  program  without 
running  the  bot.  Along  with  the  input-command  strings  extracted  from  a  bot  program,  BCE 
also  provides  a  sequence  of  API  calls  controlled  by  each  command,  which  can  help  the  user 
understand  the  API-level  behavior. 
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3.  BCE  is  not  based  on  signatures.  Some  recent  approaches  to  finding  out  botnet  commands 
are  based  on  pattern-matching  techniques.  Many  bot  programs  use  standard  string-library 
functions  to  process  the  input  command  string,  as  shown  in  Fig.  5.4(a).  The  assembly  code 
of  Fig.  5.4(a)  obtained  using  the  IDAPro  disassembler  is  shown  in  Fig.  5.4(c).  One  can 
find  a  pattern  in  the  assembly  code:  there  are  two  push  instructions,  one  of  which  is  for  a 
constant  string  that  IDApro  readily  identifies,  followed  by  a  call  to  strcmp.  However,  such  a 
technique  is  ad  hoc  and  can  be  easily  evaded,  e.g.,  by  changing  the  code  in  Fig.  5.4(a)  to  use 
byte -by-byte  comparison  instead  of  using  standard  library  functions,  as  shown  in  Fig.  5.4(b). 

4.  BCE  uses  directed  test  generation  [94],  enhanced  with  a  new  search  technique  that  uses 
control-dependence  information  [86]  to  direct  the  search.  Our  experiments  show  that  the 
method  provides  higher  coverage  of  the  parts  of  the  program  relevant  to  identifying  bot 
commands,  as  well  as  lower  overall  execution  time  than  the  standard  program  exploration 
that  does  not  use  control- dependence  information. 

5.  We  performed  experiments  with  four  real  bot  programs.  Our  preliminary  results  show  that 
BCE  is  able  to  effectively  extract  bot-command  information. 

Organization.  The  remainder  of  the  section  is  organized  as  follows:  §5.2.1  discusses  what  kind  of 
information  BCE  extracts,  and  how  one  can  make  use  of  the  information  to  trigger  potentially  ma¬ 
licious  behaviors  from  a  bot.  §5.2.2  presents  background  on  directed  test  generation  [94].  §5.2.3 
presents  the  enhanced  techniques  for  exploring  program  paths  that  we  developed  for  use  in  BCE. 
§5.2.4  describes  the  use  of  nondeterminism  in  BCE,  which  is  used  for  writing  “harness”  code 
to  model  possible  client  environments,  possible  inputs,  and  possible  return  values  from  library 
functions  or  system  calls.  §5.2.5  discusses  additional  information  that  BCE  recovers,  which  com¬ 
bines  the  recovered  information  about  constraints  on  inputs  with  type  information  for  the  target 
API  calls.  §5.2.6  describes  how  a  language-independent  BCE  implementation  was  created.  §5.2.7 
presents  experimental  results.  §5.2.8  discusses  the  limitations  of  BCE.  §5.2.9  discusses  related 
work.  §5.2.10  concludes. 
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5.2.1  Botnet-Command  Extractor  (BCE) 

In  this  section,  we  first  discuss  what  information  BCE  relies  on  to  extract  botnet  commands. 
We  then  summarize  the  kind  of  information  that  BCE  provides,  and  how  one  can  make  use  of  such 
information  to  generate  proper  input  commands. 

5.2.1.1  What  BCE  Relies  On 

1.  API  prototypes:  BCE  relies  on  information  about  function  prototypes  of  API  functions  (system 

calls).  For  example,  the  prototype  of  ShellExecute  is  as  follows: 

HINSTANCE  ShellExecute ( 

HWND  hwnd, 

LPCTSTR  lpOperation, 

LPCTSTR  lpFile , 

LPCTSTR  IpParameters , 

LPCTSTR  lpDirectory, 

INT  nShowCmd 
); 

lpDirectory:  [in]  A  pointer  to  a  null-terminated 
string  that  specifies  the  default  (working) 
directory  for  the  action. 

The  function  prototypes  are  used  to  construct  reasonable  input  commands  given  the  com¬ 
mand  specification  extracted  by  BCE. 

2.  Control-Dependence  Graph:  BCE  makes  use  of  the  control-dependence  graph  for  a  bot  binary 

to  optimize  its  state-space-exploration  algorithm.  We  discuss  the  use  of  control  dependences 
in  more  detail  in  55.2.3. 
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(&)  [1]  cmd  < —  char*  f  or  command  string 

[2]  token  []  <—  tokenization  of  cmd 

[3] 

[4]  if  (strcmp (token [0] ,  ‘‘hello’’)  ==  0)  { 

[5]  if  (strcmp (token [1] ,  ==  0)  { 

[6]  if  (strcmp (token [2] ,  ‘‘world’’)  ==  0)  { 

[7]  WinExec ( ‘ ‘ login. exe ’’) ; 

[8]  ShellExecute ( . . . ,  token [3] ,  ...); 

[9]  } 

[10]  } 

[11]} 

(b) 

_h _ e _ 1 _ 1 _ o _ >_ _ w _ o _ r _ 1 _ d 

1 - y - 1  LnrJ  1  Y 

token[0]  token[l]  token[2] 


(C) 

WinExec  ShellExecute 


[1] 

void 

foe 

i(char*  cmd)  { 

[2] 

int 

,  n 

=  atoi(cmd) 

[3] 

if 

(n 

>  0)  { 

[4] 

if 

(n  <  25)  { 

[5] 

ApiCall (n) ; 

[6] 

} 

[7] 

} 

[8] 

} 

The  argument  is  a 
constant  “login.exe” 


The  fifth  argument  is  from  the 
fourth  token  of  the  command, 
and  its  type  is  LPCTSTR. 


_1_  _7_  _\0 
3  \0 


n_sym_expn 

=  (cmd[0]  -  48)  x  10  n_sym_expr  >  0 

+  (cmd[l]  -  48)  &&  n_sym_expr  <  25 


(e) 


(f> 


(g) 


Figure  5.5  (a)  A  simple  example  program;  (b)  the  command  string  constructed  based  on  the  information 
obtained  from  BCE;  (c)  a  sequence  of  API  calls  obtained  from  BCE;  (d)  another  simple  example; 

(e)  constant  examples  provided  by  BCE;  (f)  the  symbolic  expression  obtained  from  BCE  for  the  argument 

n;  (g)  the  constraint  obtained  from  BCE. 
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5.2.1.2  What  BCE  Recovers  and  How  to  Use  the  Recovered  Information 

1.  Constant  command  strings  that  control  a  hot.  For  example,  there  are  three  nested  if-statements 

in  the  code  shown  in  Fig.  5.5(a).  Two  API  calls  are  invoked  when  the  three  branch  condi¬ 
tions  are  satisfied.  Suppose  that  cmd  has  been  tokenized  into  three  null-terminated  strings. 
Fig.  5.5(b)  is  the  command  string  constructed  based  on  the  information  extracted  by  BCE. 
This  information  is  obtained  from  conditional  branches  where  a  portion  of  the  command 
string  is  compared  against  some  constants,  as  the  three  strings  (“hello”,  and  “world”)  in 
the  example. 

2.  A  sequence  of  API  calls  controlled  by  each  command.  Along  with  each  command,  BCE  pro¬ 

vides  a  sequence  of  API  calls  that  are  controlled  by  the  command.  For  example,  the  code 
executed  when  the  command  string  shown  in  Fig.  5.5(b)  is  issued  subsequently  invokes 
WinExec  and  ShellExecute.  This  information  can  be  directly  used  to  get  an  idea  of  the  API- 
level  behavior  of  a  bot  without  actually  executing  it. 

3.  Information  about  the  actual  arguments  of  each  API  call.  In  addition  to  a  sequence  of  API 

calls,  BCE  provides  information  about  the  arguments  to  each  API  call,  such  as  constant 
values  for  an  argument,  symbolic  expressions,  and  constraints  on  the  symbolic  expressions, 
as  shown  in  Fig.  5.5(e),  (f),  and  (g),  respectively. 

•  Constant  arguments:  In  many  cases,  API  calls  take  constant  arguments  that  one 
can  statically  extract  from  binaries.  For  example,  the  first  argument  of  WinExec  in 
Fig.  5.5(a)  is  a  constant  string  “login.exe”.  In  addition  to  the  sequence  of  API  calls, 
information  about  argument  values  enables  one  to  get  a  better  idea  of  the  API-level 
behavior  of  a  bot  without  running  it. 

•  Symbolic  expressions  in  the  input-state  vocabulary:  BCE  also  provides  a  symbolic 
expression  for  each  actual  parameter  of  an  API  call,  along  with  its  type  information, 
as  long  as  the  argument  is  related  to  some  part  of  the  input  command.  For  example, 
ShellExecute  in  Fig.  5.5(a)  takes  the  fourth  token  of  the  input  command  as  its  fifth 
argument.  BCE  automatically  extracts  a  symbolic  expression  that  has  one  symbolic 
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term,  token  [3] ,  along  with  its  type  LPCTSTR.  The  type  information  is  obtained  from 
the  prototype  of  the  API  call.  The  type  information  is  used  to  come  up  with  a  proper 
input  string.  Given  the  information  that  the  fourth  token  is  supposed  to  be  a  null- 
terminated  string  that  specifies  a  working  directory  name,  one  can  build  up  a  complete 
command  string  as  follows: 

"hello  ,  world  C:\terap" 

Fig.  5.5(f)  shows  another  example  of  a  symbolic  expression  that  BCE  provides. 
Fig.  5.5(f)  is  the  symbolic  expression  obtained  for  n  in  Fig.  5.5(d).  In  Fig.  5.5(d),  the 
input  command  string  is  a  numeral,  which  is  converted  into  a  number  by  calling  atoi; 
the  number  is  then  passed  into  an  API  call  as  an  argument.  The  symbolic  expression 
is  in  the  input  vocabulary  in  that  the  symbols  (cmd[0]  and  cmd[l])  that  appear  in 
it  represent  individual  byte  values  of  the  input  command  string.  We  discuss  how  the 
symbolic  expression  is  generated  in  §5.2.2. 

•  Constraints  on  symbolic  expressions:  BCE  also  provides  constraints  on  the  symbolic 
expressions  extracted  for  each  actual  parameter  of  an  API  call,  if  any.  For  example, 
BCE  extracts  the  constraint  shown  in  Fig.  5.5(g)  for  the  actual  parameter  n  to  the  API 
call  in  Fig.  5.5(d). 

This  constraint  is  obtained  from  the  two  conditional  branches  that  guard  the  API  call. 
BCE  finds  out  the  conditional  branches  on  which  the  API  call  transitively  depends.  It 
only  collects  branches  whose  predicates  constrain  the  given  symbolic  expression. 

The  obtained  constraints  also  play  an  important  role  for  building  up  proper  input 
commands.  BCE  provides  some  concrete  examples  for  n,  as  shown  in  Fig.  5.5(e):  the 
numeral  strings  “17”  and  “3”  satisfy  the  two  branch  predicates  (n  >  0  and  n  <  25). 
Therefore,  these  input  strings  cause  the  API  call  to  be  invoked,  and  thus  can  be  directly 
used  to  run  the  bot  program.  However,  there  are  cases  when  the  automatically  generated 
concrete  examples  fail  to  trigger  observable  behavior  of  a  bot.  For  example,  suppose 
the  API  in  Fig.  5.5(d)  is  some  API  that  takes  an  IP  address  and  sets  up  a  connection 
to  the  server  (e.g.,  httpserver  of  SpyBot).  Because  concrete  examples  are  randomly 
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selected  to  satisfy  the  constraints  collected  during  symbolic  execution,  it  is  not  likely 
that  BCE  finds  out  a  reasonable  IP  address  unless  there  are  conditional  branches  where 
it  can  extract  proper  constraints  on  the  command.  Therefore,  in  some  cases,  the  user  is 
responsible  for  making  use  of  the  extracted  constraints  to  construct  reasonable  inputs. 

§5.2.5  discusses  other  kinds  of  information  about  the  bot’s  commands  that  BCE  provides — in  par¬ 
ticular,  information  that  combines  the  recovered  symbolic  information  about  inputs  with  type  in¬ 
formation  for  the  target  API  calls. 

5.2.2  Background  on  Directed  Test  Generation  and  Overview  of  BCE 

This  section  provides  background  on  directed  test  generation  [94],  which  collects  path  con¬ 
straints  and  uses  them  to  explore  new  paths  systematically.  In  applying  directed  test  generation  in 
BCE  to  the  problem  of  extracting  bot  commands,  we  developed  new  techniques  to  explore  pro¬ 
gram  paths,  which  differ  from  conventional  directed-test- generation  techniques.  We  discuss  our 
enhanced  search  algorithms  in  §5.2.3. 

One  example  of  a  directed  test-generation  tool  is  SAGE  [95],  which  is  a  whitebox  fuzz-testing 
tool,  an  advance  on  fuzz  testing  based  on  random  mutations.  SAGE  records  an  actual  run  of  a  pro¬ 
gram  under  test,  starting  with  a  well-formed  input,  then  symbolically  evaluates  the  recorded  trace 
and  generates  constraints  that  capture  how  the  program  uses  its  inputs.  The  generated  constraints 
are  then  systematically  modified  and  solved  with  a  constraint  solver  to  produce  new  inputs  that 
cause  the  program  to  follow  different  control-flow  paths.  The  process  is  repeated  with  a  coverage- 
maximizing  heuristic  designed  to  find  defects  as  fast  as  possible.  Fig.  5.6  shows  a  simple  example 
taken  from  [95].  There  are  5  values  leading  to  the  error  out  of  28*4  possible  values  for  4  bytes. 
Therefore,  the  probability  of  hitting  the  error  with  random  testing  is  about  1/232.  In  contrast, 
whitebox  dynamic  test  generation  can  find  the  error  in  at  most  24  =  16  iterations  (4  valid  path 
constraints  are  collected  during  the  exploration  process). 
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Algorithm  2  Single  BCE  Iteration 
Require:  A  concrete  state  S. 

Require:  A  trace  tree  T 

1:  Concretely  execute  the  program  with  the  concrete  state  S. 

2:  Let  CT  be  the  concrete  trace  obtained  from  the  concrete  execution. 

3:  Symbolically  execute  the  trace  CT. 

4:  Let  T'  be  the  trace  tree  augmented  by  the  symbolic  execution. 

5:  if  at  least  one  API  call  is  encountered  in  the  concrete  trace  then 

6:  Based  on  the  symbolic  state  obtained  in  the  symbolic  execution,  collect  information  about 

the  command  tokens  that  appear  in  the  arguments  to  each  API  call. 

7:  end  if 
8:  repeat 

9:  Choose  a  new  path  7 r  in  the  trace  tree  T. 

10:  Let  p  be  the  path-constraint  formula  obtained  by  conjoining  the  branch  constraints  along  tt. 

11:  until  p  is  satisfiable 

12:  Let  M  be  the  model  obtained  by  calling  the  constraint  solver  with  p. 

13:  Create  the  new  concrete  state  S'  updated  with  the  assignments  from  the  model  M. 


Alg.  2  shows  the  basic  search  step  of  the  BCE  algorithm.  The  outline  of  the  algorithm  is  similar 
to  typical  directed-test-generation  techniques,  which  can  be  roughly  summarized  as  repeatedly 
applying  the  following  three  steps:10 

BCE  maintains  a  trace  tree  that  is  expanded  during  the  process  of  symbolic  execution.  Each 
node  in  a  trace  tree  represents  a  different  execution  instance  of  a  branch  instruction  in  the  program. 
Each  node  can  have  two  children,  one  of  which  represents  the  first  branch  node  encountered  along 
the  path  through  the  true  successor,  the  other  of  which  is  the  first  branch  node  along  the  path 
through  the  false  successor.  The  path  from  the  root  node  to  a  leaf  node  represents  the  branch 

10The  first  step  (concrete  execution)  and  the  second  step  (symbolic  execution)  can  be  done  simultaneously,  which 
is  sometimes  called  concolic  execution  [167].  In  concolic  execution,  concrete  values  from  the  concrete  execution  state 
are  sometimes  used  to  simplify  the  symbolic  states  created  during  symbolic  execution. 
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void  top(char  input [4] )  { 
int  cnt  =  0 ; 

if  (input [0]  ==  ‘b’)  cnt++; 

if  (input [1]  ==  ‘a’)  cnt++; 

if  (input [2]  ==  ‘d’)  cnt++; 

if  (input  [3]  ==  ‘ !  ’ )  cnt++; 

if  (cnt  >=  4)  abortO; 

} 

Figure  5.6  An  example  for  whitebox  fuzz  testing 

instructions  of  a  concrete  trace.  Each  edge  holds  a  branch  constraint  obtained  from  symbolic 
execution.  Each  time  a  branch  is  symbolically  executed  (to  follow  the  direction  taken  by  a  previous 
concrete  execution),  the  trace  tree  is  extended  appropriately. 

5.2.3  Program  Exploration  using  Control-Dependence  Information 

This  section  presents  the  enhanced  techniques  for  exploring  program  paths  that  we  developed 
for  use  in  BCE.  MineSweeper  [55]  and  the  work  of  Moser  et  al.  [139]  have  shown  the  potential 
for  carrying  out  better  exploration  in  malware.  Other  tools,  such  as  SAGE,  have  addressed  the 
problem  of  path  explosion  by  introducing  heuristics  to  improve  coverage  [95].  SAGE  uses  so- 
called  generational  search  designed  to  partially  explore  the  state  spaces  of  large  applications  with 
the  aim  of  finding  bugs  faster.  As  in  most  of  other  directed-test-generation  tools,  SAGE  aims  to 
improve  test  coverage.  Unlike  bug-finding  tools  or  tools  that  aim  to  improve  coverage,  in  BCE  we 
are  interested  in  goal-directed  techniques  aimed  at  extracting  bot  commands. 

The  characteristics  of  how  the  bot  code  parses  the  transmitted  commands  and  takes  actions 
depending  on  the  parsed  commands  can  be  used  to  come  up  with  better  exploration  strategies  that 
avoid  possible  explosion  and  obtain  more  complete  specifications  about  the  command  structure. 
We  incorporated  the  following  path-exploration  strategies  into  BCE: 
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•  Choose  as  a  candidate  for  the  new  path  the  branches  that  have  a  possibility  of  leading  to  API 
calls.11 

•  Prune  the  search  performed  by  BCE  so  that  each  path  includes  a  limited  number  of  API  calls 
if  a  candidate  branch  for  extending  the  path  is  independent  of  the  branches  involved  with  the 
API  calls  already  found  in  the  path. 

The  exploration  strategies  are  based  on  the  fact  that  our  goal  is  to  identify  as  many  feasible  input 
commands  as  possible  that  lead  to  API  calls  of  interest. 

To  identify  branches  that  have  a  possibility  of  encountering  API  calls,  we  use  control- 
dependence  information.  §5.2.3. 1  discusses  control-dependence  information.  In  §5. 2. 3. 2  and 
§5. 2. 3. 3,  we  present  how  control-dependence  information  is  used  in  BCE. 

5.2.3. 1  Control  Dependence 

The  control  dependence  relation  is  one  of  the  fundamental  relationships  among  statements 
or  instructions  used  in  compilers  and  optimizers.  For  instance,  control-dependence  information  is 
used  in  compilers  to  determine  whether  it  is  safe  to  reorder  or  parallelize  statements  [86] .  A  control 
dependence  holds  when  the  decision  made  at  a  branch  X  controls  whether  another  statement  or 
instruction  Y  is  executed. 

Control  dependence  is  defined  in  terms  of  the  post-domination  relation. 

Definition  5.1  Node  Z  post-dominates  node  X  iff  Z  f  X  and  all  paths  from  X  to  the  end  of  the 
procedure  include  Z.  (Note  that  by  this  definition  a  node  does  not  post-dominate  itself.) 

Definition  5.2  Node  Y  is  directly  control  dependent  on  node  X  iff 

1.  there  exists  a  path  it:  X  —A  Y  such  that  Y  post-dominates  every  node  in  tt  different  from 
A",  and 

2.  A"  is  not  post-dominated  by  Y . 

We  use  C  to  denote  the  direct-control-dependence  relation. 


11  BCE  is  parameterized  to  take  a  list  of  interesting  API  entry  points  of  interest. 


176 


Control  dependences  can  be  broken  down  more  finely  into  dependences  on  the  true  branch  or 
false  branch  of  a  branch-node  X,  as  follows: 

Definition  5.3  Node  Y  is  directly  control -dependent  on  edge  X  — >  It'  iff 

1.  there  exists  a  path  7r:  W  — >*  Y  such  that  Y  post-dominates  every  node  in  n  different  from 
X,  and 

2.  X  is  not  post-dominated  by  Y. 

We  say  that  the  relation  Ct(X,  Y)  holds  when  X  is  a  branch  node  and  Y  is  directly  control  depen¬ 
dent  on  X’s  true  branch.  Cj  is  defined  similarly. 

Each  branch  node  is  associated  with  two  sets  of  CFG  nodes:  one  consists  of  the  transitive 
control-dependence  successors  for  its  true  branch  (denoted  by  CtC*);  the  other  consists  of  the 
transitive  control-dependence  successors  for  its  false  branch  (denoted  by  CfC*). 

CtC*  :  True  control  successors 
CfC*  :  False  control  successors 

For  example,  in  Fig.  5.7,  the  statements  (si)  and  (s2)  are  transitively  control  dependent  on 
the  true  branch  of  bl;  statement  (s3)  is  transitively  control  dependent  on  the  false  branch  of  bl. 
Statement  (s4)  is  not  transitively  control  dependent  on  any  branch  in  this  example.  (Henceforth, 
we  will  abbreviate  “transitive  control  dependence”  by  “control  dependence”.) 

In  the  next  section,  we  discuss  a  novel  usage  of  control-dependence  information  in  BCE. 

5.23.2  Choosing  Interesting  Branches  using  Control-Dependence  Informa¬ 
tion 

BCE  uses  control-dependence  information  (CDI)  to  annotate  the  trace  tree.  If  there  is  at  least 
one  API  call  in  CtC*  (or  CfC*)  of  a  branch  node,  the  node  is  marked  as  Nt  (or  Nf).  Any  branch 
that  has  a  call  to  a  function  that  contains  at  least  one  Nt  or  Nf  in  CtC*  (CfC*)  is  also  marked  as 
Nt  (or  Nf).  BCE  only  chooses  one  of  the  nodes  marked  with  Nt  or  Nj  as  a  candidate  for  the  new 
path.  Fig.  5.8  compares  an  exploration  strategy  that  uses  control-dependence  information  (CDI) 
to  one  that  does  not.  The  solid  lines  in  the  figures  indicate  the  paths  that  have  previously  been 
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[1]  if  (a  >  0)  {  //  (bl) 

[2]  b  =  1;  //  (si) 

[3]  if  (a  <  25)  {  //  (b2) 

[4]  c  =  2;  //  (s2) 

[5]  } 

[6]  } 

[7]  else  { 

[8]  d  =  3;  //  (s3) 

[9]  } 

[10]  e  =  4;  //  (s4) 

Figure  5.7  An  example  to  show  control  dependences. 


Figure  5.8  Two  trace  trees;  (a)  A  trace  tree  without  CDI;  (b)  a  trace  tree  with  CDI;  the  circles  represent 
branch  nodes;  the  solid  arrows  represent  possible  paths  to  explore;  the  half-shaded  circles  represent  nodes 

labeled  as  either  Nj  or  Nt. 


explored.  One  chooses  as  the  next  candidate  one  of  the  nodes  (on  the  solid  lines  in  Fig.  5.8)  that 
has  a  solid  edge  to  only  one  child.  Such  choices  are  marked  with  solid  grey  arrows.  There  are  fewer 
candidates  to  explore  in  Fig.  5.8(b)  than  in  Fig.  5.8(a).  The  degree  of  the  improvement  by  using 
CDI  depends  on  the  percentage  of  nodes  marked  with  Nt  or  Nj.  We  discuss  how  the  approach 
works  out  with  real  bot  programs  in  §5.2.7. 
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[1]  char*  pi;  //  input; 

[2]  char  p2  []  =  "bot .  execute"  ; 

[3]  int  v; 

[4]  char  cl; 

[5]  do  { 

[6]  cl  =  *pl++; 

[7]  c2  =  *p2++; 

[8]  v  =  (unsigned) cl  -  (unsigned) c2; 

[9]  if  (v  ! =  0) 

[10]  break; 

[11]  }  while (cl  !  =  ’\0’) ; 

[12] 

[13]  if (v  ==  0) 

[14]  APICall 


Figure  5.9  An  example  in  which  it  is  necessary  to  choose  an  alternative  candidate  as  a  new  path;  the 

source  code  of  strcmp  is  inlined  in  this  example. 


Algorithm  3  ChooseNewPath 
Require:  A  trace  tree  T 

Ensure:  Formula  ip 

1:  Let  Frontier  be  the  branch  node  in  T  that  is  either  marked  as  Nj  and  does  not  have  a  false 
child  in  T,  or  marked  as  Nt  and  does  not  have  a  true  child  in  T,  and  has  the  shortest  path  from 
the  root  node. 

2:  Let  ip  be  the  formula  conjoined  with  all  the  formulas  associated  with  the  branches  on  the  path 
from  Frontier  back  to  the  root  node. 

3;  Return  ip 


Algorithms.  Alg.  3  and  Alg.  4  describe  the  path-exploration  algorithm  of  BCE.  In  Alg.  3,  BCE 
chooses  a  node  n  in  the  trace  tree  marked  as  Nj  or  Nt  whose  corresponding  branch  is  not  in  the 
trace  tree.  BCE  then  conjoins  all  the  formulas  of  the  branches  on  the  path  from  n  back  to  the  root 
node.  Alg.  4  takes  that  formula  and  calls  a  constraint  solver  to  obtain  a  model.  If  the  formula  for 
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Algorithm  4  GenerateNewConcreteState 
Require:  A  trace  tree  T 

Ensure:  A  concrete  state  OS' 

1:  p  =  ChooseNewPath(T) 

2:  Call  the  constraint  solver  with  the  formula  p 

3:  if  p  is  feasible  then 

4:  Let  M  be  the  model  from  the  constraint  solver 

5:  Let  CS  be  a  random  concrete  state 

6:  Let  CS'  be  CS  updated  with  all  the  assignments  in  M 

7:  Return  CS' 

8:  else 

9:  Let  T'  be  T  augmented  with  a  dummy  node  at  the  previously  selected  node 

10:  GenerateNewConcreteState(T') 

11:  end  if 


the  path  that  BCE  chose  to  explore  is  feasible,  it  generates  a  new  concrete  state  that  gets  used  in 
the  next  round  of  exploration.  Otherwise,  it  augments  the  trace  tree  so  that  the  previously  explored 
path  is  never  selected  again,  and  calls  itself  recursively. 

Fig.  5.10(a)  is  an  example  in  which  the  number  of  possible  execution  paths  is  exponential 
in  the  number  of  branches:  each  of  the  5  if-statements  is  independent  of  each  other.  For  this 
code  fragment,  BCE  takes  8  iterations  when  it  uses  CDI,12  of  Alg.  2  to  identify  2  different  paths 
(one  toward  the  API  call  inside  the  second  if -statement,  and  the  other  toward  the  fifth  statement) 
whereas  without  CDI  it  exhibits  exponential  behavior. 

Indirect  control-dependence.  In  some  cases,  it  is  possible  that  a  candidate  node  marked  as  Nt 
or  Nf  has  a  branch  predicate,  the  negation  of  which  causes  the  path  constraint  to  be  infeasible, 
that  does  not  help  program  exploration.  For  example,  in  Fig.  5.9,  pi  points  to  the  input  character 
array,  and  p2  points  to  the  constant  string  "bot .  execute " .  The  branch  on  line  13  is  marked  as  Nt 

12The  body  of  strcmp  includes  some  branches  to  compare  an  individual  character  of  the  first  argument  with  one 
constant  character  from  the  second  argument.  To  get  to  the  two  API  call  sites,  BCE  needs  several  trials  for  each. 
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[1]  if(strcmp(c  [0]  ,  "aaa")==0)  { 

[2]  n  =  atoi  (c  [5]  )  ; 

[3]  } 

[4]  if(strcmp (c  [1]  ,  "bbb")==0)  { 

[5]  APICalll (...); 

[6]  } 

[7]  if(strcmp(c  [2]  ,  "ccc")==0)  { 

[8]  n  =  atoi  (c  [5] )  ; 

[9]  } 

[10]  if(strcmp (c  [3]  ,  "ddd")==0)  { 

[11]  n  =  atoi  (c  [5] )  ; 

[12]  } 

[13]  if(strcmp(c  [4]  ,  "eee")==0)  { 

[14]  APICall2 (...); 

[15]  } 

(a) 

Figure  5.10  (a)  An  example  with  independent  if 
(b)  An  example  more  typical  of  bo 


[1]  if(strcmp(c  [0]  ,  "aaa")==0)  { 

[2]  n  =  atoi  (c  [5] )  ; 

[3]  } 

[4]  else  if(strcmp(c  [1]  ,  "bbb")==0)  { 

[5]  APICalll (...)  ; 

[6]  } 

[7]  else  if(strcmp(c  [2]  ,  "ccc")==0)  { 

[8]  n  =  atoi  (c  [5] )  ; 

[9]  } 

[10]  else  if(strcmp(c  [3]  ,  "ddd")==0)  { 

[11]  n  =  atoi  (c  [5]); 

[12]  } 

[13]  else  if(strcmp(c  [4]  ,  "eee")==0)  { 

[14]  APICall2 (...); 

[15]  } 

(b) 

statements  (and  thus  an  exponential  number  of  paths). 
:  code  (with  a  linear  number  of  paths). 


because  its  true  branch  contains  an  API  call.  Suppose  that  in  the  initial  concrete  state,  the  first  input 
byte  pointed  to  by  pi  is  something  different  from  ’b\  and  thus  the  loop  in  lines  5-11  terminates 
at  line  9  after  one  iteration  with  the  condition  v  !  =  0,  and  the  false  branch  of  line  13  is  executed. 
In  the  subsequent  symbolic  execution  in  which  the  character  array  pointed  to  by  pi  is  treated  as  a 
list  of  symbols,  the  path  constraint  toward  the  true  branch  at  line  13  is 

(Sc  1  -  cb  ±  0)  A  (Scl  -Cb  =  0), 

where  Sci  is  a  symbol  that  represents  the  first  input  byte,  and  Cb  is  a  constant  symbol.  This  formula 
is  infeasible.  In  such  cases,  as  a  heuristic,  BCE  chooses  branches  prior  to  the  candidate  node  on 
the  trace  as  an  alternative  candidate.  In  this  example,  the  false  branch  at  line  9  is  chosen  as  a  new 
path  so  that  from  the  path  constraint 

Scl  -  Cb  =  0, 
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the  constraint  solver  can  provide  a  new  test  input  in  which  the  first  input  byte  equals  'b' . 

When  a  situation  occurs  like  the  one  described  for  line  13,  a  command-line  flag  controls  how 
many  prior  branches  to  try. 

5.2.3.3  Pruning  the  Trace  Tree  using  Control-Dependence  Information 

CDI  helps  to  direct  program  exploration  toward  API  call  sites.  However,  even  when  some 
candidate  branches  are  excluded  by  CDI,  there  is  still  the  possibility  of  combinatorial  explosion. 
For  example,  in  Fig.  5.10(a),  there  are  24  paths  in  total  that  invoke  the  API  call(s):  there  are  8 
paths  that  invoke  each  call  (and  not  the  other)  and  an  additional  8  that  invoke  both.  When  the 
branches  controlled  by  different  commands  are  independent  of  each  other,  it  means  that  multiple 
commands  can  be  combined  to  produce  different  sequences  of  API  calls.  In  other  words,  if  there 
are  n  independent  if-statements  involved  with  API  calls,  the  total  number  of  possible  paths  that 
invoke  at  least  one  API  call  is  2n. 

To  avoid  such  combinatorial  explosion,  we  limit  the  exploration  performed  by  BCE  so  that 
each  path  includes  a  limited  number  of  API  calls  if  a  candidate  branch  for  extending  the  path  is 
independent  of  the  branches  involved  with  the  API  calls  already  found  in  the  path.  In  particular,  the 
path  exploration  in  BCE  only  finds  n  paths  when  there  are  n  independent  if-statements  involved 
with  API  calls.  The  information  obtained  in  this  way  is  still  useful  to  a  user,  although  it  shifts  the 
burden  onto  the  user  to  identify  the  API-level  behaviors  of  a  bot  by  trying  various  combinations  of 
the  n  extracted  commands.  For  the  example  in  Fig.  5.10(a),  BCE  only  extracts 

“bbb”  for  the  second  token  of  cmd 
“eee”  for  the  fifth  token  of  cmd 

and  the  user  can  try  running  the  bot  with  the  three  kinds  of  inputs — “bbb”,  “eee”,  and  “bbb”  + 
“eee” — to  observe  possibly  different  behaviors. 

The  heuristic  for  avoiding  combinatorial  explosion  is  performed  by  pruning  the  trace  tree  dy¬ 
namically.  The  following  code  illustrates  what  is  involved  in  dynamically  pruning  the  trace  tree. 
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Figure  5.11  (a)  A  control-dependence  graph;  (b)  a  trace  tree  when  sub-trees  are  pruned  using 
control-dependence  graph  (a);  (c)  another  control-dependence  graph;  (d)  the  trace  tree  when  sub-trees  are 

pruned  using  control-dependence  graph  (c). 


Fig.  5.11(a)  is  the  control-dependence  graph  of  the  code,  and  Fig.  5.11(b)  is  the  corresponding 
trace  tree. 


[1]  if  (strcmp (token [0]  ,  ‘‘hello’’)  ==  0)  { 

[2]  APICalll ( . . . ) 

[3]  if  (atoi  (token  [1] )  >0) 

[4] 

[5] 

[6]  } 


Figure  5.12  A  simple  example  for  pruning. 


An  API  call  is  invoked  immediately  in  the  true  branch  of  line  1  in  Fig.  5.12.  In  this  case,  BCE 
considers  pruning  the  sub-tree  ST  of  the  trace  tree  starting  from  line  3.  The  control-dependence 
information  is  used  to  determine  whether  the  sub-tree  ST  is  to  be  excluded  from  further  exploration. 
ST  can  be  excluded  if  it  does  not  include  any  node  marked  as  Nt  or  Nf  that  is  control  dependent  on 
line  3  (see  Fig.  5. 1 1(b)).  If  there  is  at  least  one  other  API  call  in  line  4,  as  shown  in  Fig.  5.1 1(c)  and 
(d),  the  true  branch  remains  as  a  candidate  to  explore  because  the  second  if-statement  is  control 
dependent  on  the  first  one. 
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In  practice,  many  bot  programs  are  written  as  shown  in  Fig.  5.10(b),  where  each  if-statement 
is  dependent  on  other  ones.  However,  even  if  when  they  are  rewritten  in  the  form  of  Fig.  5.10(a), 
the  pruning  technique  is  effective  in  practice. 

5.2.4  Using  Nondeterminism  to  Sidestep  System  Calls 

Many  formalisms  for  symbolic  analysis  of  programs  support  the  use  of  nondeterminism,  which 
is  useful  for  writing  “harness  code”  (code  that  models  the  possible  client  environments  from  which 
the  code  being  analyzed  might  be  called),  as  well  as  for  modeling  the  possible  inputs  to  a  pro¬ 
gram.  A  common  approach  is  to  provide  a  primitive  that  returns  an  arbitrary  value  of  a  given 
type.  Examples  include  the  SdvMakeChoice  primitive  of  SLAM  [46]  and  the  havoc  (x)  primitive 
of  BoogiePL  [48]. 

In  some  cases,  a  value  returned  from  a  system  call  or  a  Windows-API  call  is  used  in  a  branch 
condition,  as  shown  in  Fig.  5.13.  If  GetCurrentDirectory  returns  a  value  greater  than  0,  APICalll 
is  invoked;  otherwise,  APICall2  is  invoked. 


[1] 

for  (i  =  0;  i  <  3; 

i  +  +)  { 

[2] 

int  n  =  GetCurrentDirectory ( 

[3] 

if  (n  >  0)  { 

[4] 

APICalll (. 

. .) 

[5] 

} 

[6] 

else  { 

[7] 

APICall2( . 

. .) 

[8] 

} 

[9] 

} 

Figure  5.13  A  simple  example  for  modeling  a  system  call. 


In  the  current  version  of  BCE,  concrete  execution  and  symbolic  execution  do  not  go  into  system 
calls  and  Windows  API  functions.  Instead,  BCE  keeps  a  sequence  of  random  numbers  ( RandSeq ) 


for  concrete  execution,  and  a  sequence  of  symbols  ( RandSeq )  for  symbolic  execution.  During 
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concrete  execution  and  symbolic  execution,  the  successive  values  in  RandSeq  and  RandSeq,  re¬ 
spectively,  are  used  as  the  successive  return  values  from  API  call  sites.  In  the  above  example,  there 
are  three  calls  to  GetCurrentDirectory  in  a  trace  because  the  loop  is  executed  three  times.  Each  of 
the  three  return  values  comes  from  successive  elements  of  RandSeq  and  RandSeq.  In  this  way,  we 
model  the  state  of  the  operating  system.  Network  inputs  are  modeled  similarly. 

5.2.5  Extracting  Type  Information 

§5.2.1  briefly  discussed  how  one  can  use  the  information  extracted  from  BCE  to  understand  a 
bot  program  and  construct  proper  input  commands.  This  section  discusses  some  additional  infor¬ 
mation  that  BCE  provides  to  help  users  understand  the  recovered  information  about  the  botnet’s 
commands,  based  on  combining  the  recovered  symbolic  information  about  inputs  with  type  infor¬ 
mation  for  the  target  API  calls. 

Some  extracted  constant  command  strings  can  be  directly  used  to  trigger  interesting  API-level 
behaviors  of  a  bot  program  in  cases  where  there  are  no  additional  arguments  to  a  command.  How¬ 
ever,  some  of  the  information  extracted  about  a  command  is  in  the  form  of  symbolic  expressions. 
A  symbolic  expression  captures  the  semantics  of  all  the  instructions  on  a  specific  path  from  the 
starting  point  to  the  API  call  site.  In  some  cases,  the  extracted  symbolic  expression  simply  repre¬ 
sents  a  sub-string  of  the  command,  whereas  there  are  other  cases  when  the  command  is  converted 
to  another  form.  A  typical  action  is  to  convert  part  of  the  input  string,  using  the  standard  library 
function  atoi,  into  a  number  that  is  passed  to  the  API  call.  In  other  words,  the  input  string  holds 
numerals,  whereas  the  API  call  receives  a  number. 

Once  BCE  extracts  a  symbolic  expression  for  an  argument  to  an  API  call,  it  is  the  user’s  respon¬ 
sibility  to  choose  a  proper  input  with  which  to  run  the  bot  based  on  the  symbolic  expression.  To 
help  in  this  step,  BCE  extracts  type  information  for  each  symbolic  expression  using  the  algorithms 
shown  in  Alg.  5  and  Alg.  6. 

Alg.  5  and  Alg.  6  are  pseudo-code  for  collecting  type  information  for  each  extracted  symbolic 
expression.  Our  approach  uses  information  about  the  function  prototypes  of  API  calls,  as  well  as 
a  database  of  OS  and  network-related  types.  For  example,  Fig.  5.14(a)  shows  the  prototype  of 
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Algorithm  5  ExtractTypelnformation 
Require:  A  function  prototype  T 

Require:  A  symbolic  state  S 

Require:  The  current  stack  address  sp 

Ensure:  Updated  database 

1:  Let  N  be  the  number  of  arguments  of  function  type  T 
2:  for  i  =  0  to  N  —  1  do 

3:  Let  Tt  be  the  type  of  the  ith  argument  of  function  type  T 

4:  addri  =  sp  +  i  *  paramxsize 

5:  Co  1 1  cc  tTy  pc  I  n  fo  rm  at  i  o  n  (T, ,  addri) 

6:  end  for 


getaddrinfo  and  the  struct  types  ADDRINFO  and  sockaddr_in.  ADDRINFO  is  the  type  of  the 
third  and  fourth  arguments  of  getaddrinfo,  and  sockaddr_in  is  the  type  of  one  of  the  fields  of 
ADDRINFO. 

Lor  each  API  call  site,  BCE  collects  type  information  by  calling  ExtractTypelnformation 
(Alg.  5).  Along  with  such  information,  ExtractTypelnformation  takes  the  symbolic  state  at  the 
API  call  site,  and  the  symbolic  expression  that  represents  the  current  stack  pointer.  Lor  example, 
Pig.  5.14(b)  is  an  example  that  includes  a  call  to  the  system  call  getaddrinfo.  The  first  token 
of  the  command  is  converted  to  a  numeric  value  through  atoi  to  be  used  as  sin_zero  for  the 
sockaddr_in  object,  and  the  second  token  is  used  as  ai_canonname  for  the  ADDRINFO  object. 
BCE  calls  CollectTypelnformation  with  the  actual  arguments — ADDRINFO*  and  the  current  stack 
pointer — for  the  third  argument  of  getaddrinfo. 

For  each  argument  to  the  API  call,  it  calculates  the  address  of  the  corresponding  stack  location, 
and  passes  it  to  CollectTypelnformation  (Alg.  6),  along  with  the  argument  type  from  the  function 
prototype  and  the  symbolic  state.  CollectTypelnformation  is  a  recursive  function  that  tries  to  asso¬ 
ciate  each  type  with  the  corresponding  symbolic  expression  on  the  stack.  Depending  on  the  type, 
the  actions  are  slightly  different: 
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Algorithm  6  CollectTypelnformation 
Require:  A  type  T 

Require:  An  address  addr 

Require:  A  symbolic  state  S 

Ensure:  Updated  database 

1:  if  T  is  a  pointer  type  T'*  then 

2:  Let  synuexpr  be  the  symbolic  expression  obtained  by  looking  up  addr  in  S. 

3:  Insert  the  mapping  (sym_expr,  T'*)  into  the  database 

4:  Let  addr  be  the  symbolic  expression  at  address  synuexpr  in  S 

5:  if  addr'  is  a  scalar  then 

6:  Col lcctTypeln format ion(T',  addr) 

7:  end  if 

8:  else  if  T  is  a  basetype  then 

9:  Let  sym_expr  be  the  symbolic  expression  obtained  by  looking  up  addr  in  S. 

10:  Insert  the  mapping  ( synuexpr ,  T )  into  the  database 

11:  else  if  T  is  a  structure  type  then 
12:  for  all  Tt  a  field  type  of  T  do 

13:  Co  1 1  cc  tTy  pc  I  n  fo  rm  at  i  o  n  (T, ,  addr  +  offset f) 

14:  end  for 

15:  end  if 


•  In  the  case  of  a  pointer  type  T *,  BCE  first  adds  the  mapping  ( syrruexpr ,  T *)  to  the  database, 
and  looks  up  the  corresponding  value  in  the  symbolic  state,  and  recursively  calls  Collect¬ 
Typelnformation,  passing  the  value  along  with  the  type  T  of  the  object  referred  to.  For 
example,  CollectTypeInformation(AWRIWO*,  sp)  recursively  calls 

CollectTypeInformation(AWRIWO,  S(sp)n) 

•  In  the  case  of  a  basetype  T,  BCE  looks  up  the  corresponding  value  ( syrruexpr )  in  the  sym¬ 
bolic  state,  and  it  adds  the  mapping  (, syrruexpr ,  T).  For  example,  the  first  token  of  the 


13S(sp)  denotes  a  lookup  of  sp  in  symbolic  state  S. 
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[1]  int  getaddrinfo  ( 

[i] 

sockaddr_in*  s  =  ...;  //  malloc 

[2] 

char* 

nodename ; 

[2] 

s->sin_zero  =  atoi  (cmd_token  [0]  )  ; 

[3] 

char* 

servname ; 

[3] 

ADDRINFO*  a  =  . . . ;  //  malloc 

[4] 

ADDRINFO* 

hints ; 

[4] 

a->ai_canonname  =  . . . ;  //  malloc 

[5] 

ADDRINFO* 

res ; 

[5] 

strcpy (a->ai_canonname ,  cmd_token [1] ) ; 

[6]  }; 

[6] 

w 

II 

u 

TJ 

xJ 

1 

•H 

Cti 

A 

1 

[7]  struct  { 

[7] 

getaddrinfo ( . . . ,  ...,  a,  ...); 

[8] 

[9] 

char* 

ai.canonname ; 

[10] 

sockaddr_in* 

ai.addr ; 

[11] 

[12] }  ADDRINFO; 

[13]  struct  { 

[14] 

[15] 

unsigned  long 

sin_zero ; 

[16]} 

sockaddr_in; 

(a)  (b) 

Figure  5.14  (a)  The  prototypes  of  getaddrinf  o,  ADDRINFO,  and  sockaddr_in;  (b)  an  example  code 

fragment. 


command  is  used  for  the  field  sin_zero  of  sockaddr_in  in  Fig.  5.14,  which  is  of  base-type 
unsigned  long.  In  this  case,  BCE  collects  the  information  that  the  associated  symbolic 
expression  is  of  type  unsigned  long. 

•  In  the  case  of  a  structure  type,  such  as  struct  or  class,  BCE  iterates  over  the  structure’s 
fields,  calling  CollectTypelnformation  with  each  type  and  the  address  of  the  correspond¬ 
ing  field.  For  example,  CollectTypeInfonnation{kDVRINFO,  S(sp ))  recursively  calls  Collect- 
TypeInformation(  char  * ,  S(sp)  +  offseti),  CollectTypeInformation(sock&ddr_iri*,  S(sp)  + 
offset2),  and  so  forth,  where  offset^  is  the  corresponding  offset  for  each  field. 
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5.2.6  Implementation 

The  BCE  implementation  has  been  structured  so  that  it  can  be  retargeted  to  different  languages 
easily.  The  core  components  of  the  system  are  language-independent  in  two  different  dimensions: 

1.  The  BCE  driver  implements  Alg.  2.  It  is  structured  so  that  one  only  needs  to  provide  an 
implementation  of  concrete  execution  and  symbolic  execution  of  a  language.  Consequently, 
this  component  of  the  system  can  be  used  for  source-level  languages  or  for  machine-code 
languages. 

2.  For  machine-code  languages,  we  used  the  TSL-generated  primitives  for  concrete  exe¬ 
cution  and  symbolic  execution.  The  TSL-generated  symbolic-analysis  primitives  enable 
to  obtain  accurate  path  constraints.  Consequently,  unlike  SAGE  or  other  tools  that  use 
approximation — e.g.,  all  non-linear  operations  (such  as  multiplication,  division,  and  bitwise 
arithmetic)  as  well  as  symbolic  dereferences  of  pointers,  are  concretized  either  for  efficiency 
or  due  to  technical  difficulty — BCE  guarantees  no  divergences  as  discussed  in  §4.7. 

Control-Dependence  Information.  The  control-dependence  information  used  for  the  systematic 
path-exploration  of  BCE  is  collected  from  the  control-dependence  graph  for  a  bot  program.  BCE 
uses  CodeSurfer/x86  [44]  to  obtain  the  control-dependence  graph  for  a  bot  program. 

API  Call  Prototypes.  BCE  uses  IDApro  [18]  and  its  Fast  Library  Identification  and  Recognition 
Technology  (FLIRT)  [9]  to  identify  calls  to  library  functions.  It  then  uses  a  database  of  func¬ 
tion  prototypes  and  OS  and  network-related  types  to  extract  type  information  from  the  recovered 
symbolic  information,  as  described  in  §5.2.5. 

Library  Functions.  In  BCE,  each  library-function  call  is  replaced  with  a  simplified  model  on 
which  concrete  and  symbolic  execution  are  performed  as  with  other  user  functions. 

5.2.7  Experiments 

We  performed  experiments  on  four  bot  programs.  The  bots  are  from  different  families,  and 
they  have  different  sets  of  commands.  Fig.  5.15  summarizes  the  experimental  results.  The  table 
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Bot  Program 

Results 

Time 

Name 

#  Instrs. 

%  Nf/Nt 

#  Traces 

#  SymExprs 

#  Iterations 

Trace  Leng 

Avg.CE 

Total.CE 

Avg.SE 

Total.SE 

Avg.PE 

Total. PE 

Total 

dBot 

32168 

19% 

18 

7 

89 

1893 

2.6 

231.4 

4.8 

427.3 

0.9 

831.3 

1489.9 

AgoBot 

54641 

36% 

17 

8 

123 

4167 

7.9 

979.1 

12.5 

1538.7 

16.8 

2067.6 

4585.4 

SpyBot 

8360 

40% 

31 

10 

279 

1290 

3.9 

1074.2 

7.2 

2003.2 

8.5 

2374.3 

5451.7 

EvilBot 

2917 

29% 

17 

4 

133 

2476 

2.5 

333.8 

4.4 

589.2 

2.5 

328.5 

1251.5 

Figure  5.15  BCE  experiments.  The  columns,  in  order,  are:  the  number  of  instructions  (#Instrs); 
the  percentage  of  nodes  marked  as  either  Nj  or  N,  in  the  final  trace  tree;  the  number  of  unique 
traces  ending  with  at  least  one  API  call;  the  number  of  commands  for  which  BCE  provides 
symbolic  expressions;  the  total  number  of  iterations  to  identify  the  traces;  the  average  trace 
length;  the  average  time  taken  for  concrete  execution;  the  total  time  taken  for  concrete  execution; 
the  average  time  taken  for  symbolic  execution;  the  total  time  taken  for  symbolic  execution;  the 
average  time  taken  for  path  exploration;  the  total  time  taken  for  path  exploration;  and  the  total 
time  taken  in  seconds.  The  experiments  were  run  on  a  Intel  P4  1.79GHz  machine  with  1.49GB 

RAM. 


Bot  Program 

Configuration 

Name 

w/  CDI  &  w/  Pruning 

w/o  CDI  &  w/  Pruning 

w/  CDI  &  w/o  Pruning 

w/o  CDI  &  w/o  Pruning 

dBot 

18/89  (20%) 

18/101+  (<18%) 

18/99+  (<  18%) 

11/142+  (<8%) 

AgoBot 

17/123  (14%) 

17/172+  (<10%) 

17/158+  «11%) 

13/167+  (<8%) 

SpyBot 

31/279(11%) 

28/281+ (<  10%) 

27/420+  (<6%) 

25/528+  (<5%) 

EvilBot 

17/133  (13%) 

14/206+  «7%) 

17/163+  «10%) 

11/308+  (<4%) 

Figure  5.16  BCE  experiments.  The  table  reports  results  for  four  configurations  of  BCE:  (1)  “w/ 
CDI”  and  “w /  Pruning”,  (2)  “w/o  CDI”  and  “w/  Pruning”,  (3)  “w/  CDI”  and  “w/o  Pruning”,  and 
(4)  “w/o  CDI”  and  “w/o  Pruning”.  The  numbers  reported  in  each  column  are  the  number  of 
unique  traces  ending  with  API  call(s),  the  total  number  of  iterations,  and  the  percentage  of 
iterations  that  resulted  in  a  trace  ending  with  API  calls.  The  experiments  were  run  on  a  Intel  P4 
1.79GHz  machine  with  1.49GB  RAM;  the  symbol  “+”  after  the  number  of  iterations  means  that 
BCE  with  the  configuration  did  not  finish  (i.e.,  program  exploration  could  continue  infinitely  even 

if  all  possible  commands  had  been  identified.) 

first  shows  the  size  of  each  program  in  terms  of  the  number  of  instructions,  and  the  percentage  of 

the  branches  marked  as  Nj  or  Nt  for  each  program. 

The  four  columns  listed  under  “Results”  shows  the  number  of  traces  ending  with  at  least  one 

API  call,  the  total  number  of  iterations  performed  by  BCE,14  and  the  number  of  the  command 

14 An  iteration  means  one  run  of  the  basic  search  step  of  the  BCE  algorithm  (Alg.  2);  on  each  iteration,  a  new  path 
is  found  that  leads  to  a  new  concrete  state. 
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strings  that  expect  one  or  more  arguments.  BCE  provides  a  symbolic  expression  for  such  argu¬ 
ments,  as  discussed  in  §5.2.5. 

For  dBot  and  AgoBot,  we  had  source  code  and  we  were  able  to  compare  the  extracted  com¬ 
mands  with  the  commands  that  one  can  obtain  from  the  source  code.  In  case  of  AgoBot,  there 
are  two  commands — “bot.quit”  and  “bot.die” — that  were  not  identified  as  bot  commands  by  BCE, 
but  are  actually  commands.  This  is  because  they  are  not  involved  with  any  Windows  API  call. 
Those  commands  modify  some  values  to  change  the  state  of  the  bot.  Even  though  BCE  was  able 
to  identify  those  strings,  BCE  did  not  mark  them  as  commands  because  BCE  requires  some  API 
call  to  be  controlled  by  an  input  string  for  the  string  to  be  classified  as  a  command.  Each  complete 
command  string,  such  as  “bot.die\0”,  is  extracted  through  multiple  BCE  iterations  as  follows: 

“bot.d” 

“bot.di” 

“bot.die” 

“bot.die\0” 

If  there  is  no  indication  that  the  extracted  string  is  a  command  (i.e.,  it  controls  no  API  calls),  such 
as  “bot.die”,  there  needs  to  be  some  manual  interpretation  of  BCE’s  results,  such  as  whether  one 
should  consider  an  array  of  bytes  in  the  input  that  ends  with  a  delimiter  (e.g.,  \0  in  case  of  strcmp) 
to  be  a  command. 

We  also  performed  an  experiment  to  determine  how  well  the  two  state- space-exploration  strate¬ 
gies  that  we  introduced  in  §5. 2. 3. 2  and  5. 2. 3. 3  perform:  one  strategy  chooses  a  path  that  has  the 
possibility  of  encountering  API  calls  (denoted  as  “w/  CDI”);  the  other  stops  further  exploration 
along  the  current  path  once  the  trace  encounters  an  API  call  (denoted  as  “w/  Pruning”). 

The  results  are  shown  in  Fig.  5.16.  We  compared  the  number  of  traces  ending  with  API  calls 
and  the  total  number  of  iterations  under  the  configuration  “w/  CDI”  and  “w/  Pruning”  with  three 
other  configurations — (i)  “w/o  CDI”  and  “w/  Pruning”,  (ii)  “w/  CDI”  and  “w/o  Pruning”,  and  (iii) 
“w/o  CDI”  and  “w/o  Pruning”.  BCE  performs  best  using  the  configuration  “w/  CDI”  and  “w/ 


Pruning”. 
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One  other  way  in  which  the  four  configurations  differed  is  in  their  ability  to  report  whether  all 
commands  had  been  found.  Only  the  configuration  “w/  CDI”  and  “w/  Pruning”  is  able  to  do  this; 

i.e.,  it  exhausted  its  (pruned)  search  space  and  hence  could  report  that  there  was  nothing  more  to  be 
found.  With  the  other  configurations,  BCE  did  not  finish  even  if  it  had  identified  all  the  commands. 

As  explained  in  §5. 2. 3. 3,  the  user  must  bear  in  mind  that  the  commands  identified  are  really 
command  fragments,  and  various  combinations  of  the  command  fragments  must  be  tried. 

5.2.8  Limitations 

BCE  currently  has  the  following  limitations: 

1.  Plain  ( unpacked )  binaries.  BCE  only  handles  unpacked  binaries.  In  principle,  directed  test  gen¬ 

eration  is  applicable  even  for  packed  binaries  by  invoking  a  decoder  on  the  fly  during  con¬ 
crete  execution.  However,  the  current  implementation  of  BCE  needs  a  preprocessing  step 
to  obtain  control-dependence  information,  which  our  implementation  obtains  from  a  pre¬ 
built  control-flow  graph.  One  would  need  some  heuristics  other  than  control-dependence 
information  as  an  alternative  for  avoiding  combinatorial  explosion. 

2.  Manual  identification  of  the  right  starting  point.  BCE  starts  its  exploration  from  some 

command-processing  function  other  than  main.  This  allows  relatively  short  traces  for  both 
concrete  execution  and  symbolic  execution,  resulting  in  better  overall  performance  of  BCE. 
Typically,  there  is  some  initialization  code  between  the  beginning  of  the  main  function 
and  the  command-processing  function  that  is  not  relevant  to  extracting  input  commands. 
However,  this  can  be  problematic  if  the  initialization  code  affects  concrete  execution  in 
significant  ways.  Finding  a  way  to  start  BCE  from  the  very  beginning  of  a  program  with 
low  cost  is  left  for  future  work. 

3.  Approximation.  BCE  currently  approximates  some  library  function  calls  by  using  some  simpli¬ 

fied  models.  For  example,  dBot  uses  snprintf  as  follows  to  generate  a  string  in  a  specific 
format  for  the  purpose  of  sending  a  log  to  the  bot-master. 

snprintf  (buf  ,  sizeof(buf),  ‘  <0/0s  %s\r\n’  ’  ,  ...,  a[x+l]); 
where  a  [x+1]  is  one  of  the  command  tokens. 
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A  portion  of  the  command  is  copied  into  buf  in  snprintf .  The  buf  is  then  passed  as  a 
parameter  to  an  API  call. 

If  concrete  execution  and  symbolic  execution  go  inside  of  snprintf,  BCE  can  obtain  a 
symbolic  expression  for  buf  that  contains  symbols  from  the  input  command.  Instead  of 
doing  that,  to  simplify  BCE’s  handling  of  calls  to  snprintf,  we  model  snprintf  as  a  copy 
operator  so  that  the  input  command  symbol  a[x+l]  is  copied  into  the  buffer  buf  ignoring 
the  format  string. 

4.  Obfuscation  on  branch  conditions.  BCE  relies  on  branch  conditions  to  explore  a  program. 
Therefore,  if  the  branch  conditions  are  obfuscated  by  encryption,  it  prevents  BCE  from 
exploring  program  paths  correctly.  For  example,  fragment  (a)  below  is  a  normal  branch 
condition  that  checks  a  byte  value  against  a  constant.  As  proposed  by  Sharif  et  al.  [168],  the 
code  can  be  obfuscated  as  shown  in  fragment  (b).  Because  it  is  difficult  to  invert  the  hash 
function,  it  is  infeasible  to  find  c  given  Hc. 

[1]  if  (X  ==  c)  {  [1]  if  (Hash(X)  ==  Hc)  { 

[2]  B  [2]  run  DecryptCB^;,  c) 

[3]  }  [3]  } 

[4]  //  where  Hc  =  Hash(c)  ,  B^  =  Encrypt(B,  c) 

(a)  (b) 

5.2.9  Related  Work 

Machine-Code  Analyzers  Targeted  at  Finding  Vulnerabilities.  §5.1.5  discussed  some  work  on 
techniques  to  detect  security  vulnerabilities  by  analyzing  source  code  (for  a  variety  of  languages). 
MineSweeper  [55],  the  work  of  Moser  et  al.  [139],  and  SAGE  have  been  discussed  in  §5.2.3. 
Dynamic  Techniques.  J.  Caballero  et  al.  proposed  techniques  that  can  be  used  to  extract  the 
format  of  the  protocol  messages  sent  from  a  bot-master  by  analyzing  bot  binaries  [58].  They  intro¬ 
duced  a  technique  called  buffer  deconstruction  that  builds  the  message  field  tree  of  a  sent  message 
by  analyzing  how  the  output  buffer  is  constructed.  Furthermore,  they  used  type-inference-based 
techniques  to  find  out  the  type  information  of  each  field  of  the  extracted  structure  by  monitoring 
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how  the  received  (or  sent)  data  is  used  at  places  where  the  types  are  known,  such  as  system  calls. 
Their  technique  focuses  on  extracting  message  formats  given  proper  inputs  that  trigger  malicious 
actions,  whereas  BCE  aims  to  extract  such  proper  inputs. 

Cho  et  al.  proposed  a  technique  for  inferring  protocol  state  machines  and  applied  it  to  the  anal¬ 
ysis  of  botnet  Command  and  Control  (C&C)  protocols  [65].  The  inferred  protocol  state  machines 
can  be  used  for  formal  analysis  for  botnet  defense,  including  finding  the  weakest  links  in  a  pro¬ 
tocol,  uncovering  protocol  design  flaws,  inferring  the  existence  of  unobservable  communication 
back-channels  among  botnet  servers,  etc. 

5.2.10  Conclusion 

We  developed  a  tool  called  BCE  that  automatically  extracts  botnet-command  information  from 
bot  executables,  without  using  source  code  or  symbol-table/debugging  information.  The  informa¬ 
tion  obtained  using  BCE  can  be  used  to  build  up  proper  input  commands  that  trigger  API-level 
behaviors.  BCE  furnishes  other  kinds  of  information  about  a  bot’s  commands,  in  particular,  in¬ 
formation  that  combines  the  recovered  symbolic  information  about  inputs  with  type  information 
for  the  target  API  calls.  BCE  also  provides  a  sequence  of  API  calls  controlled  by  each  command, 
which  helps  users  to  understand  the  bot’s  API-level  behaviors. 

BCE  performs  directed  test  generation  on  executables  and  incorporates  a  new  search  technique 
based  on  control-dependence  information.  Our  experiments  showed  that  the  new  search  strategies 
developed  for  BCE  yielded  both  substantially  higher  coverage  of  the  parts  of  the  program  relevant 
to  identifying  bot  commands,  as  well  as  lowered  run-time. 
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Chapter  6 
Conclusion 

As  discussed  in  Chapter  2,  the  problem  of  analyzing  executables  to  recover  information  about 
their  execution  properties  has  been  receiving  increased  attention,  in  part  because  of  the  WYSIN- 
WYX  phenomenon.  The  WYSINWYX  phenomenon  is  due  to  several  drawbacks  of  source-code 
analysis  and  can  be  addressed  only  by  machine-code  level  analysis.  The  approach  of  working  with 
machine-code  exposes  the  actual  instmctions  that  will  be  executed,  and  thus  works  on  an  artifact 
that  reveals  the  actual  behavior  that  arises  during  program  execution. 

Although  establishing  execution  properties  at  the  machine-code  level  is  a  daunting  task  due 
to  the  challenges  of  machine-code  analysis,  as  discussed  in  Chapter  2,  several  research  efforts 
have  been  made  to  develop  tools  and  techniques  for  machine-code  analysis.  One  major  effort  is 
CodeSurfer/x86,  of  which  I  was  partly  involved  in  the  development.  In  Chapter  2,  we  presented 
the  two  applications  that  I  developed — FFE/x86  and  ConSeq — that  made  use  of  CodeS urfer/x86. 

Unfortunately,  although  the  techniques  incorporated  into  CodeSurfer/x86  are,  in  principle, 
language-independent,  they  were  instantiated  only  for  a  single  instruction  set  (Intel  x86).  As  al¬ 
ready  mentioned  in  Chapter  1,  this  situation  is  common  in  work  on  program  analysis:  although  the 
techniques  described  in  the  literature  are  language-independent,  analysis  implementations  are  of¬ 
ten  tied  to  particular  language-specific  compiler  infrastructure.  Unlike  the  situation  in  source-code 
analysis,  which  can  be  addressed  by  developing  common  intermediate  languages,  machine-code 
analysis  suffers  from  the  fact  that  instruction  sets  typically  have  hundreds  of  instructions  and  a 
variety  of  architecture-specific  features  that  are  incompatible  with  other  architectures.  With  future 
computing  platforms  based  on  multicore  architectures  and  transactional  memory,  future  runtime 
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environments  using  just-in-time  compiling,  future  systems  providing  cloud  computing  and  auto¬ 
nomic  computing,  plus  cell  phones,  PDAs,  wearable  computers,  and  autonomous  vehicles  all  en¬ 
tering  the  fray,  both  (i)  security  and  reliability  problems,  and  (ii)  the  variety  of  computing  platforms 
to  analyze  will  only  increase. 

To  address  these  concerns,  we  developed  improved  techniques  for  analyzing  machine  code — in 
particular,  a  language  called  TSL  (for  “Transformer  Specification  Language”)  for  describing  the 
semantics  of  an  instruction  set,  along  with  a  runtime  system  to  support  the  creation  of  a  multiplicity 
of  static-analysis,  dynamic-analysis,  and  symbolic-analysis  components. 

In  addition  to  the  two  applications  to  CodeSurfer/x86  presented  in  Chapter  2,  the  main  contri¬ 
butions  that  this  dissertation  made  can  be  summarized  as  follows: 

•  In  Chapter  3,  we  presented  the  TSL  system  in  detail.  In  the  TSL  system,  analysis  components 
are  generated  from  formal  specifications  of  the  abstract  syntax  and  the  concrete  semantics 
of  an  instruction  set.  TSL  was  presented  from  two  perspectives:  (i)  how  to  write  a  TSL 
specification  from  the  point  of  view  of  instruction-set-specification  developers,  and  (ii)  how 
to  write  TSL  reinterpretations  from  the  point  of  view  of  analysis  developers. 

In  §3.2,  we  presented  various  techniques  incorporated  to  implement  the  TSL  compiler, 
which  translates  a  specification  to  a  common  intermediate  representation  (CIR).  The  tech¬ 
nical  contributions  that  we  made  in  the  design  and  development  of  the  TSL  system  can  be 
summarized  as  follows: 

-  Two-level  semantics  ( along  with  binding-time  analysis):  A  two-level  CIR  allows 
the  precision  of  an  abstract  transformer  to  sometimes  be  improved — and  never  made 
worse — by  interpreting  subexpressions  associated  with  the  manipulation  of  concrete 
values  in  concrete  semantics,  which  the  specification  of  an  instruction  set  often  con¬ 
tains.  This  is  done  by  separating  the  subexpressions  associated  with  the  manipula¬ 
tion  of  abstract  values  in  abstract  semantics  from  other  manipulations  that  can  always 
be  treated  as  concrete  values.  To  this  end,  we  made  use  of  the  existing  technique  of 
binding-time  analysis  [109]. 
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-  Paired- semantics:  The  TSL  system  allows  easy  instantiations  of  reduced  products 
[74]  by  means  of  paired  semantics.  One  can  use  the  paired- semantics  mechanism  to 
obtain  desired  multi-phase  interactions  among  TSL-generated  analyzers.  By  creating 
a  duplicated,  but  improved  CodeSurfer/x86,  we  demonstrated  that  this  method  of  CIR 
instantiation  is  useful  for  performing  a  form  of  reduced  product  when  analyses  are  split 
into  multiple  phases,  as  in  a  tool  like  CodeSurfer/x86. 

-  With-normalization  and  pattern  compilation:  TSL  provides  a  mechanism  for  de- 
construction  by  means  of  pattern  matching.  The  TSL  front-end  performs  with- 
normalization,  which  transforms  all  multi-level  with  expressions  to  use  only  one-level 
patterns;  an  efficient  pattern  matcher  is  then  generated  via  the  pattern-compilation  al¬ 
gorithm  developed  by  Pettersson  [153,  178]. 

-  Execution  over  abstract  states:  An  appropriate  translation  of  conditional  expressions 
and  recursion  functions  allows  to  create  abstract  interpreters  for  an  instruction-set  spec¬ 
ification:  in  particular,  the  code  generated  for  each  transformer  is  able  to:  (i)  execute 
over  abstract  states  (§3.2.2),  (ii)  possibly  propagate  abstract  states  to  more  than  one  suc¬ 
cessor  in  a  conditional  expression  (§3.2.2. 1),  (iii)  compare  abstract  states  and  terminate 
abstract  execution  when  a  fixed  point  is  reached  (§3. 2. 2. 2),  and  (iv)  apply  widening 
operators,  if  necessary,  to  ensure  termination  (§3. 2. 2. 2). 

In  chapter  3,  we  summarized  the  applications  that  the  TSL  system  has  been  applied  to, 
including  the  various  static-analysis  components  generated  from  the  TSL  specification  of 
the  IA32  instruction  set  to  develop  a  new  incarnation  of  CodeSurfer/x86 — a  revised  version 
whose  analysis  components  are  implemented  via  TSL.  The  analogous  components  for  the 
PowerPC32  instruction  set  were  generated  from  a  TSL  specification  of  PowerPC32. 

We  also  discussed  the  leverage  that  the  TSL  system  provides  in  §3.4.  We  showed  that 
the  TSL  system  provides  considerable  leverage  for  implementing  analysis  tools  and  experi¬ 
menting  with  new  ones.  New  analyses  are  easily  implemented  because  a  clean  interface  is 
provided  for  defining  an  interpretation. 
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The  reinterpretation  mechanism  allows  TSL  to  be  used  to  implement  tool-component  gen¬ 
erators  and  tool  generators.  Each  implementation  of  an  analysis  component’s  driver  (e.g., 
fixed-point- finding  solver,  symbolic  executor)  serves  as  the  unchanging  driver  for  use  in 
different  instantiations  of  the  analysis  component  for  different  instruction  sets.  The  TSL 
language  becomes  the  specification  language  for  retargeting  that  analysis  component  for  dif¬ 
ferent  instruction  sets. 

Furthermore,  for  a  system  like  CodeSurfer/x86 — which  uses  multiple  analysis  phases — 
automating  the  process  of  creating  abstract  transformers  ensures  semantic  consistency ;  that 
is,  because  analysis  implementations  are  generated  from  a  single  specification  of  the  instruc¬ 
tion  set’s  concrete  semantics,  this  guarantees  that  a  consistent  view  of  the  concrete  semantics 
is  adopted  by  all  of  the  analysis  implementations  used  in  the  system. 

•  In  Chapter  4,  we  presented  a  novel  way  to  obtain  semantic  reinterpretation  automatically, 
via  mutually-consistent,  correct-by-construction  implementations  of  symbolic  primitives — 
in  particular,  quantifier-free,  first-order-logic  formulas  for 

-  (a)  symbolic  evaluation  of  a  single  command, 

-  (b)  WCV  with  respect  to  a  single  command,  and 

-  (c)  symbolic  composition  for  a  class  of  formulas  that  express  state  transformations, 
for  every  instruction  set  for  which  one  has  a  TSL  specification.  We  also  demonstrated  that 
semantic  reinterpretation  could  create  such  primitives  for  languages  with  pointers,  aliasing, 
dereferencing,  and  address  arithmetic. 

As  far  as  we  are  aware,  the  application  of  semantic  reinterpretation  to  a  logic  is  a  new 
idea.  A  related  innovation  on  which  our  results  rest  was  to  define  a  particular  form  of  state- 
transformation  formula  (structure-update  expressions)  as  a  first-class  notion  in  the  logic.  By 
this  device,  such  formulas  could  (i)  serve  as  a  replacement  domain  in  the  reinterpretations  of 
both  the  programming  language’s  meaning  functions  and  the  logic’s  meaning  functions,  and 
(ii)  be  reinterpreted  themselves. 
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•  In  Chapter  5,  we  presented  two  applications — MCVETO  and  BCE — developed  using  TSL- 
generated  analysis  components,  which  use  logic-based  search  procedures  to  establish  prop¬ 
erties  of  machine-code  programs.  Compared  to  work  by  others  on  logic-based  search  pro¬ 
cedures  for  machine  code,  what  distinguishes  the  work  on  MCVETO  and  BCE  is  that  both 
applications  are  goal-directed.  That  is,  they  both  have  a  target  property  or  program  point  of 
interest,  and  this  target  is  used  to  focus  the  search. 

-  MCVETO.  MCVETO  is  a  tool  to  check  whether  a  stripped  machine-code  program  sat¬ 
isfies  a  safety  property.  The  chapter  described  how  verification  of  machine  code  in 
MCVETO  is  performed,  and  discussed  how  MCVETO  avoids  using  conventional  tech¬ 
niques  on  software  model  checking  that  would  be  unsound  if  applied  at  the  machine- 
code  level. 

MCVETO  is  capable  of  verifying  (or  detecting  flaws  in)  self-modifying  code  (SMC). 
With  SMC  there  is  no  fixed  association  between  an  address  and  the  instruction  at  that 
address,  but  this  is  handled  automatically  by  MCVETO’s  mechanisms  for  abstraction 
refinement.  To  the  best  of  our  knowledge,  MCVETO  is  the  first  model  checker  to  handle 
SMC. 

In  Chapter  5,  we  also  presented  a  language-independent  algorithm  to  identify  the 
aliasing  condition  relevant  to  a  property  in  a  given  state.  Unlike  previous  techniques,  it 
applies  when  static  names  for  variables/objects  are  unavailable. 

We  also  developed  several  techniques  to  enhance  the  methods  used  during  directed 
proof  generation  to  elaborate  the  abstraction  in  use:  the  techniques  enable  exhaustive 
loop  unrolling  to  be  avoided  by  discovering  the  right  loop  invariant.  The  method  in 
which  we  exploit  program  invariants  allows  soundness  to  be  retained  at  all  times  even 
though  the  techniques  we  use  for  obtaining  invariants  are  speculative. 

-  BCE.  BCE  is  a  tool  for  automatically  extracting  botnet-command  information  from 
bot  executables,  without  using  source  code  or  symbol-table/debugging  information. 
The  information  obtained  using  BCE  can  be  used  to  build  up  proper  input  com¬ 
mands  that  trigger  API-level  behaviors.  What  distinguishes  BCE  from  other  existing 
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symbolic-execution-based  test-generation  tools  is  that  BCE  is  goal-directed,  using  a 
new  search  technique  that  I  developed  based  on  control-dependence  information. 
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Appendix  A:  User  Guide  for  Tsl 

Appendix  A  describes  the  Transformer  Specification  Language  (TSL).  It  also  contains  infor¬ 
mation  about  how  to  write  a  TSL  specification  of  the  programming  language  of  interest  (which  we 
call  the  subject  language).  The  TSL  system  is  applicable  to  both  source  languages  and  low-level 
machine  code.  Machine-code  languages  are  used  in  the  examples  and  descriptions  in  this  manual. 

TSL  is  a  strongly  typed,  first-order  functional  language  with  a  datatype-definition  mechanism 
for  defining  recursive  datatypes,  plus  deconstruction  by  means  of  pattern  matching.  Much  of 
what  a  TSL  user  writes  an  instruction-set  specification  is  similar  to  writing  an  interpreter  for  an 
instruction  set  in  first-order  ML.  The  user  specifies  (i)  the  abstract  syntax  of  an  instruction  set,  by 
defining  the  constructors  for  a  (reserved,  but  user-defined)  type  instruction,  (ii)  an  execution-state 
type,  by  defining  type  State. 

Lexical  Matters.  An  identifier  is  a  sequence  of  letters,  digits,  or  underscore  characters,  begin¬ 
ning  with  a  letter  or  an  underscore.  Upper-  and  lower-case  letters  are  considered  distinct  characters. 
The  following  identifiers  are  reserved  and  may  not  be  used  for  other  purposes. 

true,  false,  with,  default,  let,  in,  phylum,  MAP,  COMMON, 

EXPORT,  UNIQUEREP,  NOWIDEN,  DECLARATIONS, 

FUNCTIONLIST,  EXPORT_FUNCTIONLIST 

Blanks,  tabs,  and  newlines  in  the  specification  file  are  ignored  except  that  they  serve  to  delimit 
tokens.  Comments,  delimited  by  //  and  a  newline,  may  appear  after  any  token. 

TSL  Specification.  Each  TSL  specification  consists  of  a  list  of  declarations ,  which  are  split  into 
two  parts:  a  definition  of  an  abstract  syntax,  given  as  a  set  of  grammar  rules,  and  a  list  of  functions. 
A  specification  is  structured  as  follows: 
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NAME:  instruction  set -name 

DECLARATIONS  { 

production -declarations 

} 

FUNCTIONLIST  { 

function-declarations 

} 

EXPORT_FUNCTIONLIST  { 

exported -function -declarations 

} 

DECLARATIONS,  FUNCTIONLIST,  and  EXPORT_FUNCTIONLIST  blocks  can  ap¬ 
pear  in  any  order.  Each  part  can  be  repeated  in  a  specification.  DECLARATIONS 
contains  definitions  of  user-defined  types  (production -declarations).  FUNCTIONLIST  and 
EXPORT_FUNCTIONLIST  contain  user-defined  functions  (function -declarations)  and  an 
exported-function  list  (exported -function-declarations),  respectively.  §A.l,  §A.2,  and  §A.4  de¬ 
scribe  how  to  write  production,  function,  and  exported-function  declarations,  respectively. 

A.l  Type  Definitions  (DECLARATIONS) 

A.1.1  Phyla,  Operators,  and  Terms 

The  core  of  a  specification  for  a  given  language  is  the  definition  of  the  language’s  abstract 
syntax,  given  as  a  set  of  grammar  rules.  The  grammar  rules  are  essentially  productions  of  a  regular- 
tree  grammar. 

The  derivation  trees  derived  from  nonterminal  symbols  are  known  as  terms  and  the  set  of 
terms  derived  from  a  given  nonterminal  symbol  constitute  a  phylum.  The  grammar  should  be 
viewed  as  a  type-definition  mechanism  in  which  the  nonterminal  symbols  are  type  names  and 
each  nonterminal  symbol,  taken  as  a  type  name,  denotes  a  set  of  values  known  as  a  phylum. 
We  often  refer  to  nonterminal  symbols  as  phyla,  although  more  precisely  they  are  the  names  of 
phyla.  Each  production  derives  terms  that  can  be  thought  of  as  77-ary  records.  The  alternatives 
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of  a  given  nonterminal  give  rise  to  different  record  variants.  Terms  are  used  both  (i)  as  abstract 
representations  of  instructions,  operands,  and  other  syntactic  constructs  and  (ii)  as  computational 
values.  Each  production  has  a  name,  known  as  an  operator ,  that  can  be  used  in  computational 
expressions  (in  different  contexts)  both  as  a  record  constructor  and  as  a  selector  that  discriminates 
between  variants. 

The  concepts  phylum ,  operator ,  and  term  are  defined  mutually  recursively.  A  phylum  is  a  set 
of  terms.  A  term  is  the  result  of  applying  a  k- ary  operator  to  k  terms  of  the  appropriate  phyla.  A 
k-ary  operator  is  a  constructor-function  mapping  k  terms  to  a  term.  Operators  are  typed. 

Productions,  nonterminal  symbols,  and  operator  names  are  defined  simultaneously  in  phylum 
declarations . 

Example  A.l(a).  Let  us  consider  a  phylum  of  binary  trees,  TREE.  Associated  with  TREE  are 
two  operators:  Leaf  (of  arity  0),  and  Node  (of  arity  2,  with  parameter  phyla  TREE  and  TREE). 
TREE  can  be  defined  inductively  as  follows: 

1)  The  term  Leaf()  is  in  TREE; 

2)  If  f i  and  f2  are  terms  in  TREE,  then  the  term  Node(ti,  t2)  is  in  TREE; 

3)  No  other  terms  are  in  TREE. 


Phylum  TREE  is  the  infinite  collection  of  terms 

{ 

Leaf () , 

Node  (Leaf  (),  LeafO), 

Node  (Node  (Leaf  () ,  LeafO),  LeafO), 
Node  (LeafO,  Node  (LeafO,  LeafO)), 


} 
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A.  1.2  Basetypes 

Fig.  A.l  shows  the  basetypes  that  TSL  provides.  There  are  two  categories  of  primitive  base-types: 
unparameterized  and  parameterized.  An  unparameterized  base-type  is  just  a  set  of  terms.  For 
example,  BOOL  is  a  phylum  consisting  of  truth  values,  INT32  is  a  phylum  consisting  of  32-bit 
signed  whole  numbers,  etc.  MAP[o:,  f3]  is  a  predefined  parameterized  phylum,  with  parameters  a 
and  (3.  Each  of  the  following  is  an  instance  of  the  parameterized  phylum  MAP: 

MAP[INT32,INT8] 

MAP  [INT32,  BOOL] 

MAP  [INT32  ,MAP  [INT8 ,  BOOL]  ] 

TSL  provides  special  syntax  for  denoting  the  terms  of  primitive  phyla,  often  referred  to  as  con¬ 
stants.  For  example,  the  truth  values  of  phylum  BOOL  are  denoted  by  true  and  false,  the  integers 
in  phylum  INT8  are  denoted  by  0d8,  1d8,  2d8,  etc.  The  syntax  of  these  primitive  constants  is 
summarized  in  Fig.  A.l. 


Phylum 

Terms 

Constants 

BOOL 

false,  true 

false,  true 

INT64 

64-bit  signed  integers 

0d64,  1d64,  2d64, ... 

INT32 

32-bit  signed  integers 

0d32, 1d32,  2d32, ... 

INT16 

16-bit  signed  integers 

0d16,  1d16,  2d16, ... 

INT8 

8 -bit  signed  integers 

0d8,  1d8,  2d8, ... 

STR 

Sequences  of  characters. 

All  characters  except 

’\000’  permitted. 

it  II 

"ab...AB... 01. ..!%..." 

"\n\r\b\t\f\’\"\\" 

"\001  \002\003..." 

MAP[a,/?] 

Maps 

no  constants 

Figure  A.l  Syntax  of  constants  of  primitive  phyla. 
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Some  primitive  values  do  not  have  corresponding  constant  denotations.  For  example,  there  is 
no  TSL  constant  corresponding  to  negative  one,  since  —1  is  an  expression  —  the  negation  function 
applied  to  positive  one. 

§A.3  presents  the  operators  of  the  TSL  base-types. 

A.1.3  Syntax 

A  production  declaration  defines  a  new  operator  and  includes  all  terms  constructible  by  that 
operator  in  a  given  phylum.  The  form  of  a  production  declaration  is 

phylum-name  :  operator-name  ( phylum i  <identifieri  >  ■  ■  ■  phylunik  <identifierk  >  )  ; 

The  phylum  named  by  phylum-name  is  referred  to  as  the  left-hand-side  phylum,  phylum i,  ..., 
phylunik  are  the  parameters  of  the  operator  operator-name.  A  production  declares  that  all  terms 
constructed  by  applying  A'-ary  operator  operator-name  to  argument  terms  of  phyla  phylumi,  ..., 
phylunik  are  members  of  the  left-hand-side  phylum.  An  operator  may  not  be  associated  with  more 
than  one  phylum.  Each  parameter  is  associated  with  a  name.  The  parameter  names  identifier  ... 
identifierk  need  to  be  distinctive  in  an  operator. 

Example  A.  1(b).  The  following  code  snippet  shows  an  example  of  a  definition  of  AST  syntax 
rules: 

DECLARATIONS  { 

reg32 :  EAX()  |  EBX(); 

operand:  DirectReg32 (reg32<Reg>) 

|  Immediate32 (INT32<Val>) 

> 

instruction:  ADD (operand<Opl>  operand<0p2>) 


state :  State  (MAP  [INT32JNT8]  <Memory>  MAP  [reg32 ,  INT32]  <Registers>) 


} 
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A.  1.4  Reserved,  but  User-Defined  Types 

Each  instruction-set  specification  must  include  definitions  of  the  following  types: 

reg64,  reg32,  regl6,  reg8,  cc,  instruction,  and  state 

Exported  phyla  are  treated  as  interfaces  between  a  TSL  specification  of  a  subject  language  and  a 
client  analysis  for  the  language. 

Each  reserved  type  is  annotated  with  EXPORT  and  either  <E>  or  <R>  ( binding-directive ). 
There  are  two  kinds  of  phyla:  concrete  phyla  and  abstract  phyla.  If  a  phylum  is  only  used  as  a 
concrete  type,  such  as  reg32  and  instruction,  the  phylum  is  annotated  with  <E>.  If  a  phylum 
is  to  be  used  in  a  reinterpreted  semantics,  such  as  state,  the  phylum  is  annotated  with  <R>.  The 
TSL  system  generates  a  common  intermediate  representation  in  which  phyla  annotated  with  <E> 
are  converted  to  concrete  types,  and  the  ones  annotated  with  <R>  support  both  concrete  types  and 
reinterpreted  versions  of  those  types. 

Example  A.  1(c).  Because  reg32,  instruction,  and  state  are  reserved,  the  code  in  Example 
2.2.1  (b)  is  amended  as  follows: 

DECLARATIONS  { 

EXPORT  reg32<E> :  EAXQ  |  EBX()  ; 
operand:  DirectReg32 (reg32<Reg>) 

|  Immediate32 (INT32<Val>) 

J 

EXPORT  instruction<E> 

:  ADD(operand<Opl>  operand<0p2>) 

> 

EXPORT  state<E>:  State  (MAP  [INT32JNT8]  <Memory>  MAP  [reg32 ,  INT32]  <Registers>) 


} 
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A.  1.5  Redefinable  Phylum  Definitions 

TSL  allows  one  to  associate  base-types  (especially  parameterized  base-types,  such  as  MAP) 
with  other  names.  Each  phylum  defined  as  reinterpretable  can  be  reinterpreted  in  a  client  applica¬ 
tion.  The  form  of  a  reinterpretable-type  declaration  is 

phylum  phylum  identifier', 

phylum  MAPf/t/tv/wm,  <binding-directive>,  phylum _2]  identifier', 

Binding-directive  (<E>  |  <R>)  controls  the  reinterpretation  property  of  the  key  type  of  the 
map.  binding-directive  <E>  is  used  in  the  examples  in  this  chapter.1 

In  addition  to  the  unparameterized  base-types,  such  as  BOOL  and  INT32,  such  user-defined 
reinterpretable  types,  such  as  MEMMAP32_8  and  REGMAP32,  are  reinterpreted  with  new  types  pro¬ 
vided  by  an  analysis  developer  to  create  an  analysis  component. 

Example  A.l(d).  The  following  code  is  a  part  of  file-system  definition.  FILESTREAM  is  defined 
as  MAP[INT64<E>,INT8];  the  key  type  of  FDATA  is  renamed  as  inode;  and  FDATA  is  defined  as 
MAP[inode<E>, FILESTREAM], 

DECLARATIONS  { 

phylum  MAP [INT64<E> , INT8]  FILESTREAM; 

phylum  INT8  inode; 

phylum  MAP  [inode<E>,  FILESTREAM]  FDATA; 

} 

Example  A.  1(e).  The  code  in  Example  2.1  (c)  can  be  rewritten  by  replac¬ 

ing  MAP[INT32,INT8]  and  MAP[reg32,INT32]  with  the  redefined  names  MEMMAP32_8  and 
REGMAP32,  respectively. 

1Ordinarily,  the  key  types  of  maps  are  <E>.  <R>  is  used  in  a  few  circumstances,  but  certain  conditions  must 
hold  for  such  a  reinterpretation  to  work  correctly.  The  TSL  system  does  not  check  whether  such  a  reinterpretation 
obeys  the  necessary  conditions. 
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DECLARATIONS  { 

EXPORT  reg32<E> :  EAX()  |  EBX()  ; 
operand:  DirectReg32 (reg32<Reg>) 

|  Immediate32 (INT32<Val>) 

> 

EXPORT  instruction<E> 

:  ADD(operand<0pl>  operand<0p2>) 


phylum  MAP  [INT32<E>  ,  INT8]  MEMMAP32_8; 
phylum  MAP  [reg32<E>  ,  INT32]  REGMAP32 ; 

EXPORT  state<E>:  State  (MEMMAP32_8<Memory>  REGMAP32<Registers>) ; 

} 

A.1.6  Phylum  Directives 

TSL  provides  two  optional  directive  s-CO  MM  ON  and  U  N IQUE  RE  P-for  phylum  declarations. 

•  COMMON  directive.  A  phylum  can  be  shared  among  various  languages  by  annotating  the 
phylum  declarations  with  the  directive  COMMON.  For  example,  the  phylum  definitions  for 
modeling  context-switches  are  language-independent. 

COMMON  phylum  MAP  [reg32<E>  ,  INT32]  SAVEREGS; 

COMMON  phylum  MAP  [cc<E>  ,  BOOL]  SAVEFLAGS; 

COMMON  context  :  Context (SAVEREGS<SaveRegs>  SAVEFLAGS<SaveFlags>) ; 

•  UNIQUEREP  directive.  A  phylum  prefixed  with  UNIQUEREP  is  translated  into  a 
type  that  only  allows  a  single  instance  to  be  constructed  of  any  given  term.  For  ex¬ 
ample,  if  QFBVFormula  is  annotated  with  UNIQUEREP,  there  is  only  one  instance  for 
each  term  of  QFBVFormula,  such  as  QFBV_TRUE()  and  QFBV_LT(QFBVSymbol32("Syml")  , 
QFBVScalar32 (0) ) . 
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COMMON  UNIQUEREP  QFBVFormula 
:  QFBV_TRUE()  |  QFBV_FALSE() 

|  QFBV_LT(QFBVTerm32<tl>  QFBVTerm32<t2>) 
QFBV_AND(QFBVFormula<fl>  QFBVFormula<f 2>) 
QFBV_OR(qFBVFormula<f  1>  QFBVFormula<f 2>) 


UNIQUEREP  cannot  precede  COMMON. 

A.2  Function  Definitions  (FUNCTIONLIST) 

The  form  of  a  function  declaration  is 

[directives]  phylum o  function-name  ( 
phylum  i  parameter-name  i , 
phylum2  parameter-name 2 , 

•  *  *  9 

phylum k  parameter-name k 
)  {  expression  }  ; 

It  declares  function-name  to  be  a  A'- ary  function  with  result  phylum  phylum(),  and  has,  for  each 
i,  1  <  i  <  k,  a  parameter  named  parameter-name  *  of  type  phylum The  body  of  the  function, 
expression,  is  an  expression  over  parameter-name 1;  . . . ,  parameter-name that  must  evaluate  to  a 
term  in  the  result  phylum  phylum o. 

Function  declarations  are  global  —  they  cannot  be  defined  inside  one  another,  nor  can  they  be 
defined  within  the  scope  of  productions.  Functions  are  not  first-class  objects,  i.e.,  they  cannot  be 
the  value  of  a  parameter  or  an  expression.  Functions  can  be  recursive. 
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A.2.1  Function  Directives 

This  section  contains  information  about  function  directives,  which  direct  how  a  function  is 
translated  into  the  common-intermediate  representation.  Function  directives  direct  the  way  the 
TSL  system  translates  a  function.  TSL  supports  the  following  directives  for  function  definitions: 

COMMON,  NOWIDEN,  and  CACHED 

•  COMMON.  A  function  can  be  shared  among  various  languages  by  annotating  the  function 
declaration  with  the  directive  COMMON.  The  directive  COMMON  causes  the  function  to  be 
generated  in  a  common  namespace.  This  directive  can  be  only  used  for  functions  that  are 
language/instruction-set-indedepndent.  E.g., 

COMMON  INT32  isSignedOverf lowForAddition(INT32  a,  INT32  b)  { 


}; 

•  CACHED.  The  directive  CACHED  causes  TSL  to  implement  function-caching  for  the  function. 
E.g., 

CACHED  BOOL  Eval .Formula (Formula  f,  state  S)  { 

//  expression 

}; 

For  example,  the  return  values  of  the  function  Eval_Formula  for  each  actual  argument  pair 
<f ,  S>  are  cached  so  that  they  can  be  retrieved  the  next  time  the  function  is  called  with  the 
same  pair  of  actuals,  instead  of  evaluating  the  whole  function  again. 

•  NOWIDEN.  When  a  tail-recursive  function  has  a  reinterpretable  argument  type  or  reinter¬ 
pretable  return  type,  the  default  way  of  translating  the  reinterpretable  version  of  the  function 
in  the  Cl R  is  to  create  a  function  template  that  will  invoke  a  widening  operation  to  ensure 
termination  [126].  The  directive  NOWIDEN  causes  the  TSL  compiler  to  translate  the  function 
to  a  recursive  C++  function  that  does  not  perform  widening.  This  directive  can  be  used  in 
the  cases  when  termination  is  guaranteed  even  without  widening.  E.g., 
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CACHED  NOWIDEN  BOOL  Eval_Formula (Formula  f,  state  S)  { 

Formula  fl  =  f.ArglO; 

Formula  f2  =  f.Arg2(); 

return  Eval_Formula(f  1 ,  S)  &&  Eval_Formula(f 2 ,  S) , 

} 

A.3  Expressions 

Expressions  occur  in  function  declarations.  §A.3.1  discusses  variables  in  expressions.  §A.3.2 
and  §A.3.3  discuss  applications  of  functions  and  operators,  and  basetype  operators,  respectively. 
§A.3.4  presents  conditional  and  binding  expressions. 

A.3.1  Variables 

A  variable  is  a  name  bound  to  a  value.  The  different  lexical  contexts  of  expressions  give  rise 
to  the  distinct  sorts  of  variables  itemized  below. 

Parameters  of  functions.  Each  parameter  of  a  function  is  a  variable  that  denotes  the  value  of  the 
corresponding  argument  passed  to  the  function.  The  type  of  such  a  variable  is  the  one  specified  for 
the  parameter  in  the  function  declaration. 

Pattern  variables.  Patterns  in  with-expressions,  (described  in  §A.3.4),  contain  pattern  variables. 
Pattern  matching  binds  each  pattern  variable  to  some  term.  Each  pattern  variable  p  has  a  scope 
within  which  p  is  a  variable  that  denotes  the  term  to  which  it  has  been  bound.  The  type  of  a  pattern 
variable  p  is  determined  by  the  context  in  which  it  first  occurs  in  a  pattern.  This  context  is  either 
the  z-th  argument  of  some  operator  g,  in  which  case  the  type  of  p  is  the  phylum  specified  for  the 
z-th  parameter  of  g,  or  it  is  an  entire  pattern,  in  which  case  the  type  of  p  is  the  type  of  the  expression 
against  which  p  is  being  matched. 

Let-bound  variables.  Binding  lists  of  let-in-expressions,  as  described  in  §A.3.4,  create  variables 
whose  scope  is  the  expression  that  follows  the  in  keyword. 
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A.3.2  Application  of  functions  and  operators. 

The  application  of  a  k- ary  function  or  operator  to  k  arguments  of  the  appropriate  phyla  is  an 
expression. 

Function  applications.  A  function  application  has  the  form 

function-name  (  expression i,  . . .  ,  expression k  ) 

Assume  that  function-name  has  been  declared  by 

phylum o  function-name  ( 
phylunii  parameter-namei, 

•  •  •  ? 

phylum k  parameter-name  k 
)  {  expression  }  ; 

and  further  assume  that  arguments  expressioni,. . .,  expression  have  values  iq, . . .,  vk,  respectively. 
Then  the  value  of  the  function  application  is  the  value  of  expression  evaluated  in  an  environment 
in  which  parameters  parameter-namei, . . .,  parameter-name/,  are  bound  to  v\,  . . .,  vk,  respectively. 
The  types  of  expression i,. . .,  expressionk  must  be  phylunii,  . . .,  phylumk,  respectively.  The  type 
of  the  application  is  phylum 0.  If  function-name  is  nullary,  an  empty  pair  of  parentheses  is  still 
required  to  indicate  function  application. 

Operator  applications.  An  operator  application  has  the  form 

operator-name  (  expressioni,  expression 2,  .  . .,  expression  ) 

Assume  the  operator  has  been  declared  by 
phylum-name 

:  operator-name  ( phylumi  <name i>  phyluni2<name2>  ■  ■  ■  phylumk<namek>)'. 
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and  further  assume  that  arguments  expression i,  . . expression  have  values  v\,  . . .,  vk  respec¬ 
tively.  Then  the  value  of  the  operator  application  is  the  term  operator-name(v i,  . . vk).  The 
types  of  expression,  . . .,  expressionk  must  be  phylumi,  . . .  phylum k,  respectively.  The  type  of  the 
application  is  phylum-name. 

A.3.3  Operations  on  primitive  phyla. 

A  collection  of  operations  on  primitive  values  is  built  into  TSL.  Operations  for  which  special 
syntax  is  provided  are  summarized  in  Fig.  A. 2.  Library  functions  on  basetypes  are  summarized  in 
Fig.  A. 3.  The  two  arguments  of  a  binary  expression  must  be  expressions  of  the  same  type. 

A.3.4  Conditional  and  binding  expressions. 

Conditional  and  binding  expressions  permit  the  value  of  an  expression  to  depend  on  the  value 
of  a  constituent  subexpression.  Three  forms  are  allowed:  with- expression,  conditional-expression, 
and  let-expression. 

With-expressions.  A  with-expression  is  a  multi-branch  conditional  expression  that  permits  dis¬ 
crimination  based  on  the  structure  of  the  value  of  a  given  expression.  The  syntax  of  a  with- 
expression  is 

with  ( identifier )  ( 
pattern i  :  expressioni, 
pattern)  :  expression 2, 

pattemn  :  expression 

2 _ 

The  value  of  identifier  is  called  the  matched  value.  The  value  of  the  with-expression  is  the  value  of 
the  expressioni  corresponding  to  the  first  pattern,  that  matches  the  matched  value.  Each  pattern , 
may  contain  pattern  variables,  which,  if  the  match  succeeds,  are  bound  to  constituents  of  the 
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Result 

Syntax 

Operation 

BOOL 

b±  &&  b2 

logical  conjunction  of  61  and  b2 

WWW 

logical  disjunction  of  W  and  b2 

w AA  w 

exclusive  logical  disjunction  of  W  and  b2 

!  b 

logical  negation  of  b 

random(BOOL) 

random  boolean  value 

ei  <  e2 

ei  less  than  e2 

ei  <=  e2 

e\  less  than  or  equal  to  e2 

ei  >  e2 

e\  greater  than  e2 

ei  >=  e2 

e\  greater  than  or  equal  to  e2 

ei  <U  e2 

e\  less  than  (unsigned)  e2 

ei  <=U  e2 

e\  less  than  or  equal  to  (unsigned)  e2 

ei  >U  e2 

e\  greater  than  (unsigned)  e2 

ei  >=U  e2 

e\  greater  than  or  equal  to  (unsigned)  e2 

ei  ==  e2 

e\  equal  to  e2 

ei  ! =  e2 

e1  not  equal  to  e2 

INT64 

*1  *  h 

product  of  W  and  i2 

INT32 

i\  /  *2 

quotient  of  ix  and  i2 

INTI  6 

*i  +  A 

sum  of  i\  and  i2 

INT8 

*1  -  A 

difference  of  /’1  and  i2 

?'i  %  i2 

ii  mod  i2 

ii  &  i2 

bitwise-and  of  W  and  i2 

*1  A^2 

bitwise-exclusive-or  of  i\  and  i2 

*1  1  *2 

bitwise-inclusive-or  of  i\  and  i2 

—  i 

negation  of  i 

~  i 

bitwise-complement  of  i 

random(a) 

random  integer  value 

MAP[a,/3] 

[OPAQUE _TYPE\  a  ^  e] 

empty  map  from  a  with  default  value  e 

m[e  1 1-  >e2] 

map  m  updated  so  that  the  image  of  ei  is  e2 

random(OPA2f/£_ryP£) 

random  map 

Figure  A. 2  Operations  on  the  primitive  phyla.  ( In  this  table,  b' s  are  BOOL  parameters,  V s  are 
integer  parameters,  m' s  are  MAP  parameters,  e’s  are  parameters  of  arbitrary  type,  a  and  3  are 
phyla,  and  OPAQUE _TYPE  is  a  reinterpretable  map-type  defined  in  a  phylum  statement  (see 

§A.1.5).) 


matched  value.  The  value  of  expression  is  then  computed  in  terms  of  those  bindings.  The  types  of 
all  expressiorii  must  be  the  same  phylum  p\  the  type  of  the  entire  with-expression  is  that  phylum  p. 
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Result 

Function(parameters) 

Operation 

INT32 

lnt8To32ZE(INT8  i) 

Inti  6To32ZE(INT  1 6  i) 
lnt8To32SE(INT8  i) 
lnt16To32SE(INT16  i) 
lnt64To32(INT64  i) 

BoolTolnt32(BOOL  b ) 
unsignedDiv32(INT32  ilt  INT32  i2) 

zero-extension  of  i  to  32-bit  value 
zero-extension  of  i  to  32-bit  value 
sign-extension  of  i  to  32-bit  value 
sign-extension  of  i  to  32-bit  value 
tmncation  of  i  to  32-bit  value 
if  b  is  true,  return  1,  otherwise  0 
unsigned  division  of  i\  by  i2 

BOOL 

getBit32(INT32  iu  INT32  i2) 

signedOverflowAdd32(INT32  ilt  INT32  i2) 

signedOverflowSub32(INT32  iu  INT32  i2) 

unsignedOverflowAdd32(INT32  ix,  INT32  i2) 

unsignedOverflowSub32(INT32  ix,  INT32  i2) 

get  the  bit  value  at  the  index  i2  in 
the  32-bit  value  i\ 
return  true  if  an  overflow  occurs  in 
a  signed  addition 

return  true  if  an  overflow  occurs  in 

a  signed  subtraction 

return  true  if  an  overflow  occurs  in 

an  unsigned  addition 

return  true  if  an  overflow  occurs  in 

an  unsigned  subtraction 

STR 

ConcatSTR(STR  si,  STR  s2) 

SubSTR(STR  s,  INT32  iu  INT32  i2) 
INT32toSTR(INT32  i) 

concatenation  of  Si  and  s2 
sub-string  of  s  from  index  to  i2 
convert  32-bit  integer  value  to  a 
string 

MEMMAP32_8 

MemAccess_32_8_LE_32( 

32-bit  little-endian  memory  ac¬ 

MEMMAP32_8  m,  INT32  i) 

cess  addressed  by  i 

MemUpdate_32_8_LE_32( 

32-bit  little-endian  memory  up¬ 

MEMMAP32_8  m,  INT32  iL,  INT32  i2) 

date 

MemAccess_32_8_BE_32( 

32-bit  big-endian  memory  access 

MEMMAP32_8  m,  INT32  i) 

addressed  by  i 

MemUpdate_32_8_BE_32( 

32-bit  big-endian  memory  up¬ 

MEMMAP32_8  m,  INT32  ilt  INT32  i2) 

date 

Figure  A. 3  Library  functions  on  the  primitive  phyla.  In  this  table,  i’s  are  integer  parameters,  s’s 
are  STR  parameters,  and  b’s  are  BOOL  parameters;  MEMMAP32_8  is  a  reinterpretable  map-type 
whose  original  type  is  MAP[INT32,INT8];  m’s  are  MAP-type  parameters. 


The  patterns  of  a  given  with-expression  must  be  exhaustive,  i.e.,  it  must  be  possible  for  the 
compiler  to  determine  statically  that  for  every  evaluation  of  the  given  with-expression,  one  of  the 
patterns  will  match.  This  will  always  be  the  case  if  one  of  the  patterns  is  *  or  default. 
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Patterns  are  defined  inductively,  as  follows: 

1)  Constants  of  primitive  phyla  (TSL  base-types)  are  patterns. 

2)  Pattern  variables  are  patterns.  A  pattern  variable  is  an  identifier. 

3)  Both  the  symbol  *  and  the  keyword  default  are  patterns. 

4)  A  A;-ary  operator  operator-name  applied  to  k  patterns  is  a  pattern: 

operator-name  ( pattern i,  . . . ,  pattern k  ) 

The  same  pattern  variable  may  occur  multiple  times  in  a  pattern.  The  leftmost  occurrence  of  a 
given  pattern  variable  is  its  binding  occurrence  and  all  subsequent  occurrences  in  the  same  pattern 
are  bound  occurrences.  The  type  of  a  pattern  variable  p  is  determined  by  the  context  of  its  binding 
occurrence.  This  context  is  either  the  i-th  argument  of  some  operator  g,  in  which  case  the  type  of  p 
is  the  phylum  specified  for  the  i-th  parameter  of  g,  or  it  is  an  entire  pattern,  in  which  case  the  type 
of  p  is  the  type  of  the  expression  against  which  p  is  being  matched. 

Let  p  be  a  pattern  and  t  be  a  term.  Then  p  is  said  to  match  t  under  the  following  circumstances: 

1)  When  p  is  a  constant  of  a  primitive  phylum  and  t  is  the  same  constant. 

2)  When  p  is  the  binding  occurrence  of  a  pattern  variable  pv,  in  which  case  pv  is  bound  to  t. 

3)  When  p  is  a  bound  occurrence  of  a  pattern  variable  pv  that  has  been  bound  to  some  term  t' 
and  t==t'. 

4)  When  p  is  either  *  or  default. 

5)  When  p  is  op(p\, . . . ,  pk )  and  t  is  op(ti,  tk )  and  p,  matches  t,,  for  alii,  1  <  i  <  k. 

The  lexical  scope  of  a  pattern  variable  bound  in  some  pattern^  begins  at  its  binding  occur¬ 
rence  and  extends  through  the  corresponding  expression,  .  The  scope  of  pattern  variables  is  block- 
structured,  i.e.,  a  given  pattern  variable  may  be  redeclared  in  an  inner  scope. 

Example  A.3.4(a).  Consider  the  following  definitions  of  phyla  ENV  and  BINDING: 

ENV 

:  NullEnvO 

I  EnvConcat (  BINDING  ENV  ) 


BINDING:  Binding (  INT32  INT8  ); 
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A  value  env  of  phylum  ENV  is  analyzed  by  the  with-expression  that  is  the  body  of  function 
lookup: 

BINDING  lookup (INT32  id,  ENV  env)  { 
with  (env)  ( 

NullEnvO:  Binding(7d32 ,  0d8)  , 

EnvConcat(b,  e) : 
with  (b)  ( 

Binding(s,  *) :  id==s  ?  b  :  lookup(id,  e) 

) 

) 

}; 

The  two  operators  Null  Env  and  EnvConcat  exhaust  all  possible  alternatives  for  ENV,  so  no  de¬ 
fault  pattern  is  necessary.  If  the  value  of  env  is  NullEnv(),  then  the  pattern  NullEnv()  matches 
it  and  the  value  of  the  with-expression  is  Binding(7d32,  0d8).  Otherwise,  the  value  of  env  is 
necessarily  a  pair  and  the  pattern  EnvConcat(b,  e)  matches  with  pattern  variables  b  and  e  bound 
to  the  first  and  second  components,  respectively.  In  this  case,  the  value  of  the  with-expression  is 
the  value  of  the  inner  with-expression,  wherein  pattern  variables  b  and  e  have  types  BINDING  and 
ENV,  respectively. 

The  same  effect  can  be  obtained  by  combining  the  two  nested  with-expressions  into  one: 

BINDING  lookup (INT32  id,  ENV  env)  { 

with  (env)  ( 

NullEnvO:  Binding(7d32 ,  0d8)  , 

EnvConcat (Binding(s ,  v) ,  e) : 
id==s  ?  Binding(s,  v)  :  lookup(id,  e) 

) 

}; 
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Conditional-expressions.  A  more  traditional  form  of  conditional  expression  is  available  in  TSL, 
based  not  on  pattern  matching  but  on  the  value  of  a  Boolean  expression.  A  conditional-expression 
has  the  form 

expression i  ?  expression2  :  expression 3 
When  expression  is  an  identifier  i,  it  is  exactly  equivalent  to  the  expression 
with  ( i )  ( true  :  expression 2,  false  :  expression 3  ) 

Let-expressions.  Let-expressions  are  useful  for  binding  values  to  names.  The  simplest  form  of 
let-expression  is: 

let  id  =  expressioni  in  (  expression^  ) 

When  several  values  are  to  be  matched,  a  more  general  form  is  available: 

let  id\  =  expression id2  =  expression 2;  . . .  idi  =  expression  in  ( 
expression 0 

2 _ 

An  occurrence  of  a  variable  cannot  be  rebound  in  subsequent  bindings.  The  last  binding  in  a 
let-expression  is  effective  in  expression 0.  The  type  of  the  let-expression  is  the  type  of  expression 0. 
A  semicolon  before  the  keyword  in  is  optional. 

The  value  of  the  general  form  of  let-expression  is  determined  as  follows: 

Each  identifier  is  bound  to  the  value  of  the  corresponding  expression.  The  value  of  the  let- 
expression  is  the  value  of  expression  as  computed  in  an  environment  containing  bindings 
for  all  variables.  The  expressions  of  a  binding  are  all  evaluated  in  the  environment  with 
includes  all  variables  bound  in  all  previous  patterns  up  to,  or  the  initial  environment  in  case 
of  the  first  binding. 
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A.4  Export  Function  Definitions  (EXPORT_FUNCTIONLIST) 

A  function  can  be  exported  to  the  interface  available  to  client  analyses  by  using  a  declaration 
of  the  following  form: 

EXPORT  <cir-directive>  function-name  (<cir-directive> ,  <cir-directive>)  ; 

The  TSL  compiler  only  translates  functions  derivable  from  the  exported  functions. 

Each  cir-directive  is  either  <E>  or  <R>.  The  return  type  and  parameter  types  of  an  exported 
function  are  annotated  with  either  <E>  or  <R>  in  a  EXPORT_FUNCTIONLIST  block.  <E> 
directs  the  TSL  system  to  translate  the  type  as  non-reinterpretable,  whereas  <R>  causes  the  type 
to  be  translated  as  reinterpretable. 

Example  A.4(a).  The  concrete  semantics  of  each  instruction  is  specified  by  defining  the  func¬ 
tion  interplnstr,  which  takes  an  instruction  and  a  state,  and  returns  an  updated  state  that  captures 
the  semantics  of  the  instruction. 

FUNCTIONLIST  { 

state  interplnstr (instruction  I,  state  S)  { 


}; 

} 

Example  A.4(b).  interplnstr  in  Example  2.4(a)  can  be  translated  into  two  versions  of  CIRs  by 
including  the  following  export-function  declarations  as  follows: 

EXPORT.FUNCTIONLIST  { 

EXPORT  <E>  interplnstr  (<E> ,  <E>)  ; 

EXPORT  <R>  interplnstr  (<E> ,  <R>)  ; 


} 

The  first  export  declaration,  in  which  all  the  types  are  declared  as  <E>,  generates  a  component 
that  can  be  used  for  creating  an  emulator  for  the  subject  language.  With  the  second  declaration, 
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in  which  the  state  types  for  both  the  input  and  the  output  are  <R>,  the  interplnstr  is  to  be 
reinterpreted  in  an  alternative  semantics: 

stated  interplnstr (instruction  I,  state#  S)  { 


}; 


A.4.1  Reserved,  but  User-Defined  Functions  (Exported  Functions) 

Tab.  A.l  shows  a  list  of  TSL  reserved  exported  functions.2  The  set  of  exported  functions  spec¬ 
ifies  the  interface  between  a  specification  and  an  analysis  client  to  create  an  analysis  component. 
A  specification  must  contain  an  EXPORT_FUNCTIONLIST  block  with  an  export-function  dec¬ 
laration  for  each  of  the  functions  in  Tab.  A.l. 

This  section  described  how  to  write  a  concrete  semantics  of  a  subject  language  in  TSL  from 
the  point  of  view  of  instruction-set-specification  (ISS)  developers.  The  TSL  compiler  automati¬ 
cally  generates  from  a  TSL  specification  a  common  intermediate  representation  (Cl R)  that  can  be 
instantiated  to  create  multiple  analysis  components.  This  chapter  presents  how  the  TSL  system 
generates  the  CIR  (§A.5),  as  well  as  how  the  CIR  is  instantiated  to  create  an  analysis  component 
(§A.6)  from  the  point  of  view  of  analysis  developers. 

A.5  Common  Intermediate  Representation 

The  TSL  system  automatically  generates  a  CIR  from  a  TSL  specification  of  the  concrete  op¬ 
erational  semantics  of  an  instruction  set.  Each  generated  CIR  is  specific  to  a  given  instruction-set 
specification,  but  common  (whence  the  name  CIR)  across  generated  analyses.  CIR  is  a  template 
class  that  takes  as  input  a  class  BT,  an  abstract  domain  for  an  analysis,  as  shown  in  Fig.  A.5.3 
This  section  describes  how  the  IR  that  TSL  uses  internally  to  represent  a  TSL  specification  (hence¬ 
forth  called  TSL-IR)  is  translated  into  the  output  CIR.  A  specification  in  TSL  is  simply  linearized, 

2The  list  can  vary  depending  on  the  client  analysis  system:  the  table  shows  a  list  of  reserved,  but  user-defined 
functions  for  machine-code  instruction  sets. 

3CIR  is  in  C++  in  reality. 
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Function  Name 

Discription 

state  interplnstr(instruction  1,  state  S) 

specifies  the  concrete  operational  semantics  of  in¬ 
struction  I 

INT32  GetPC32() 

returns  the  program  counter  (PC) 

INT32  GetSP32() 

returns  the  stack  pointer  (SP) 

INT32  AccessPC(state  S) 

returns  the  value  of  PC  in  state  S 

state  Update  PC  (state  S,  INT32  v) 

updates  the  value  of  PC  with  v  in  state  S 

INT32  AccessSP(state  S) 

returns  the  value  of  SP  in  state  S 

state  UpdateSP(state  S,  INT32  v) 

updates  the  value  of  SP  in  state  S 

INT32  GetEA32(instruction  1) 

returns  the  PC  value  of  instruction  I 

INT32  GetlnstrSize(instruction  1) 

returns  the  size  of  instruction  I 

INT32  TopOfState32(state  S) 

returns  value  at  the  address  pointed  to  by  SP 

state  Pop32 (state  S) 

adjusts  SP  to  pop  the  top  value 

state  Push32(state  S,  INT32  v) 

pushes  the  value  v  to  SP 

Table  A.  1  Reserved  exported  functions;  a  complete  list  of  reserved  export  functions  can  be  found 

in 

TSUinst  ruction  j<ets/coninion/exports.tsl 


in  evaluation  order,  into  a  series  of  C++  statements,  in  which  the  names  of  basetypes,  basetype- 
operators,  and  access/update  functions  are  prepended  with  BT  The  user-defined  abstract  syntax 
(lines  3-16  of  Fig.  A. 4)  is  translated  to  a  set  of  C++  abstract-domain  classes  (lines  2-17  of  Fig.  A. 5) 
that  contain  appropriate  abstract  operators.  The  user-defined  types,  such  as  reg32,  operand32,  and 
instruction,  are  translated  to  abstract  C++  classes,  and  the  constructors,  such  as  eax,  Indirect32, 
and  add32_32,  are  subclasses  of  the  parent  abstract  C++  class.  Each  user-defined  function  is  trans¬ 
lated  to  a  Cl R  member  function. 

Each  TSL  basetype  and  basetype-operator  is  prepended  with  the  template  parameter  name  BT; 
BT  is  supplied  for  each  analysis  by  an  analysis  developer.  The  TSL  basetype-operator  +  on  line  42 
in  Fig.  A.4  is  translated  into  a  static  function  call  on  BT::Plus,  as  shown  on  line  42  in  Fig.  A. 5. 
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The  with  expression  and  the  pattern  matching  on  lines  35-45  of  Fig.  A. 4  are  translated  to  switch 
statements  in  C++4  (lines  35-45  in  Fig.  A. 5). 

A.5.1  Translation  to  Two-Level  Common  Intermediate  Representation 

This  section  describes  a  mechanism  for  improving  a  certain  level  of  precision  of  analyzers  by 
separating  concrete  and  abstract  semantics  ( a  la  Nielson  and  Nielson  [149]). 

The  concrete  semantics  of  an  instruction  set  often  contains  some  manipulations  of  values  that 
should  always  be  treated  as  concrete  values  (for  every  abstract  interpretation  of  the  instruction). 
For  example,  the  ISS  developer  could  follow  the  approach  taken  in  the  PowerPC  manual  [27] 
and  specify  variants  of  the  conditional  branch  instruction  (BC,  BCA,  BCL,  BCLA)  of  PowerPC  by 
interpreting  one  of  the  fields  in  the  instruction  to  determine  which  of  the  four  variants  is  being 
executed.  In  this  case,  the  precision  of  an  abstract  transformer  could  be  harmed  by  interpreting 
such  subexpressions  in  the  abstract  semantics.  For  instance,  in  a  TSL  expression  v  —  (b?  1:2), 
where  b  is  definitely  a  concrete  value,  v  can  get  a  precise  value — either  1  or  2 — when  b  is  concretely 
interpreted.  However,  if  b  is  not  expressible  precisely  in  a  given  abstract  domain,  the  conditional 
expression  “(5  ?  1  :  2)”  will  be  evaluated  by  joining  the  two  branches  and  v  will  not  hold  a  precise 
value. 

To  address  this  issue,  we  perform  a  kind  of  binding-time  analysis  on  the  TSL-IR,  in  which  the 
expressions  associated  with  the  manipulation  of  concrete  values  in  an  instruction  are  annotated 
with  C,  and  others  with  A.  Then,  we  generate  a  two-level  CIR  by  appending  CONC.SEM  for  C 
values,  and  ABS.SEM  for  A  values.  The  generated  CIR  is  instantiated  for  an  analysis  transformer 
by  defining  ABS.SEM.  We  provide  a  predefined  concrete  semantics  for  CONC.SEM. 

A.6  CIR  Instantiation 

This  section  describes  how  an  analysis  developer  instantiates  the  CIR  to  create  an  analysis 
component. 

4The  TSL  front  end  performs  with-normalization ,  which  transforms  all  (multi-level)  with  expressions  to  use  only 
one-level  patterns,  using  the  pattern-compilation  algorithm  from  [153,  178], 
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The  generated  CIR  is  instantiated  for  an  analysis  by  defining  (in  C++)  an  interpretation',  a 
representation  class  for  each  TSL  basetype,  and  implementations  of  each  TSL  basetype-operator. 
Tab.  A. 2  shows  the  implementations  of  primitives  for  three  selected  analyses:  value-set  analy¬ 
sis  (VSA  [41]),  quantifier-free  bit-vector  semantics  (QFBV),  and  def-use  analysis  (DUA).  Each 
interpretation  defines  an  abstract  domain.  For  example,  line  3  of  each  column  defines  the  abstract- 
domain  class  for  INT32:  ValueSet32,  QFBVTerm32,  and  UseSet.  Each  abstract  domain  is  also 
required  to  contain  a  set  of  reserved  functions,  such  as  join,  meet,  and  widen,  which  forms  an 
additional  part  of  the  API  available  to  analysis  engines  that  use  TSL-generated  transformers. 

A.6.1  Required  Operators  of  Abstract  Domains  for  a  TSL  Reinterpretation 

Fig.  A. 6  shows  the  required  operators  that  an  abstract  domain  must  provide  for  a  TSL  reinter¬ 
pretation.  An  abstract  domain  for  a  map-basetype  must  provide  mapAccess  and  mapUpdate. 

A.6.2  Paired-Semantics 

Our  system  allows  easy  instantiations  of  reduced  products  [74]  by  means  of  paired  semantics. 
The  TSL  system  provides  a  template  for  paired  semantics  as  shown  in  Fig.  A.7. 

The  CIR  is  instantiated  with  a  paired  semantic  domain  defined  with  two  interpretations, 
INTERP1  and  INTERP2  (each  of  which  may  itself  be  a  paired  semantic  domain),  as  shown  on  line  1 
of  Fig.  A. 8.  The  communication  between  interpretations  may  take  place  in  basetype-operators  or 
access/update  functions;  Fig.  A. 8  is  an  example  of  the  latter.  The  two  components  of  the  paired- 
semantics  values  are  deconstructed  on  lines  3-6  of  Fig.  A. 8,  and  the  individual  INTERP1  and 
INTERP2  components  from  both  inputs  can  be  used  (as  illustrated  by  the  call  to  interact  on  line  7 
of  Fig.  A. 8)  to  create  the  paired-semantics  return  value,  answer.  Such  overridings  of  basetype- 
operators  and  access/update  functions  are  done  by  C++  explicit  specialization  of  members  of  class 
templates  (this  is  specified  in  C++  by  “templateo”;  see  line  2  of  Fig.  A. 8). 

This  method  of  CIR  instantiation  is  also  useful  to  perform  a  form  of  reduced  product  when 
analyses  are  split  into  multiple  phases,  as  in  a  tool  like  CodeSurfer/x86.  CodeSurfer/x86  carries 
out  many  analysis  phases,  and  the  application  of  its  sequence  of  basic  analysis  phases  is  itself 
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Table  A. 2  Parts  of  the  declarations  of  the  basetypes,  basetype-operators,  and  map-access/update 

functions  for  three  analyses. 


VSA 

QFBV 

DUA 

[1]  ( 

class  VSA_INTERP  { 

[1] 

class  QFBV_INTERP  { 

[1]  class  DUA_INTERP  { 

[2] 

//  basetypes 

[2] 

//  basetype 

[2] 

//  basetype 

[3] 

typedef  ValueSet32  INT32; 

[3] 

typedef  QFBVTerm32  INT32; 

[3] 

typedef  UseSet  INT32; 

[4] 

[4] 

[4] 

[5] 

//  basetype  operators 

[5] 

//  basetype  operators 

[5] 

//  basetype  operators 

[6] 

INT32  Add(INT32  a,  INT32  b) 

[6] 

INT32  Add(INT32  a,  INT32  b) 

[6] 

INT32  Add(INT32  a,  INT32  b) 

[7] 

{ 

[7] 

{ 

[7] 

{ 

[8] 

return  a. addValueSet (b) ; 

[8] 

return  QFBVPlus32(a,  b) ; 

[8] 

return  a. Union (b) ; 

[9] 

} 

[9] 

} 

[9] 

} 

[10] 

[10] 

[10] 

[11] 

//  map-basetypes 

[11] 

//  map-basetypes 

[11] 

//  map-basetypes 

[12] 

typedef  Dict<reg32 ,  INT32> 

[12] 

typedef  Dict<var32,INT32> 

[12] 

typedef  KillUseSet  VAR32MAP ; 

[13] 

REGMAP32 ; 

[13] 

VAR32MAP ; 

[13] 

[14] 

[14] 

[14] 

//  map-basetype  operators 

[15] 

//  map-basetype  operators 

[15] 

//  map-basetype  operators 

[15] 

INT32  Access ( 

[16] 

INT32  Access ( 

[16] 

INT32  AccessC 

[16] 

REGMAP32  m,  reg32  k)  { 

[17] 

REGMAP32  m,  reg32  k)  { 

[17] 

REGMAP32  m,  reg32  k)  { 

[17] 

return  UseSet (k); 

[18] 

return  m. Lookup (k); 

[18] 

return  m. Lookup (k); 

[18] 

} 

[19] 

} 

[19] 

} 

[19] 

REGMAP32 

[20] 

REGMAP32 

[20] 

REGMAP32 

[20] 

Update (  REGMAP32  m. 

[21] 

Update (  REGMAP32  m, 

[21] 

Update (  REGMAP32  m. 

[21] 

reg32  k,  INT32  v)  { 

[22] 

reg32  k,  INT32  v)  { 

[22] 

reg32  k,  INT32  v)  { 

[22] 

REGMAP32  a2  = 

[23] 

return  m. Insert (k,  v) ; 

[23] 

return  m. Insert (k,  v) ; 

[23] 

m . Insert2Kill (k) ; 

[24] 

} 

[24] 

} 

[24] 

return  a2 . Insert2Use(v) ; 

[25] 

[25] 

[25] 

} 

[26]}; 

[26]}; 

[26]}; 

iterated.  On  each  round,  CodeSurfer/x86  applies  a  sequence  of  analyses:  VSA,  DUA,  and  several 
others.  VSA  is  the  primary  workhorse,  and  it  is  often  desirable  for  the  information  acquired  by 
VSA  to  influence  the  outcomes  of  other  analysis  phases. 
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We  can  use  the  paired- semantics  mechanism  to  obtain  desired  multi-phase  interactions  among 
our  generated  analyzers — typically,  by  pairing  the  VSA  interpretation  with  another  interpretation. 
For  instance,  with  DUAJNTERP  alone,  the  information  required  to  get  abstract  memory  location(s) 
for  addr  is  lost  because  the  DUA  basetype-operators  (+  and  *  on  line  3  of  Fig.  A. 9)  just  return 
the  union  of  the  arguments’  use  sets  (e.g.,  see  lines  6-9  of  the  third  column  of  Tab.  A. 2.  With 
the  pairing  of  VSAJNTERP  with  DUAJNTERP  (line  1  of  Fig.  A. 8),  DUA  can  use  the  abstract 
address  computed  for  addr2  by  VSAJNTERP  (line  6  of  Fig.  A. 8),  which  uses  VSAJNTERP::Add 
and  VSAJNTERP::Mult;  the  latter  operators  operate  on  a  numeric  abstract  domain  (rather  than  a 
set-based  one). 
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[1] 

[2]  //  User-defined  abstract  syntax 

[3]  reg32 :  EAX()  |  EBX(); 

[4]  flag:  ZF()  |  SF(); 

[5]  operand32 

[6]  :  Indirect32(reg32  INT32) 

[7]  |  DirectReg32(reg32) 

[8]  j  Immediate32(INT32) 

[9]  ; 

[10]  instruction 

[11]  : ADD32_32 (operand32  operand32) 

[12]  |  M0V32_32 (operand32  operand32) 

[13]  ; 

[14]  state : State (MAP [INT32 , INT8]  //  memory 

[15]  MAP [reg32 , INT32]  //  registers 

[16]  MAP  [flag, BOOL]  )  ;  //  flags 

[17] 

[18]  //  User-defined  functions 

[19]  INT32  interp0p32(  state  S,  operand32  I  )  { 

[20]  with(S)  ( 

[21]  State (mem ,regs, flags) : 

[22]  with(srcOp)  ( 

[23]  DirectReg32(r) :  regs(r), 

[24]  Indirect32 (base ,  disp) : 

[25]  let  b  =  regs(base); 

[26]  addr  =  b  +  disp; 

[27]  (  mem(addr)  ) 

[28]  Immediate32(i) :  i 

[29]  ) 

[30]  ) 

[31]  }; 

[32]  state  updateFlag32 (state  S,  ...)  {•••} 

[33]  state  updateState32 (state  S,  ...)  {•••} 

[34]  state  interplnstr (instruction  I,  state  S)  { 

[35]  with(I)  ( 

[36]  M0V32_32 (dstOp ,  srcOp) : 

[37]  let  srcVal  =  interp0p32 (S ,  srcOp); 

[38]  in  (  updateState32 (  S,  dstOp,  srcVal  )  ), 

[39]  ADD32_32 (dstOp,  srcOp): 

[40]  let  dstV  =  interp0p32 (S ,  dstOp); 

[41]  srcV  =  interp0p32 (S ,  srcOp); 

[42]  res  =  dstV  +  srcV; 

[43]  S2  =  updateFlag(S ,  dstV,  srcV,  res); 

[44]  in  (  updateState32 (  S2,  dstOp,  res  )  ) 

[45]  ) 

[46]  } 

[47] 


[1]  template  <class  BT>  class  CIR  { 

[2]  class  reg32  {...}; 

[3]  class  EAX:  public  reg32  {...}; 

[4] 

[5]  class  operand32  {...}; 

[6]  class  Indirect32:  public  operand32  {...}; 

[7]  . .  . 

[8]  class  instruction  {...}; 

[9]  class  ADD32_32:  public  instruction  { 

[10]  enum  TSL_ID  id; 

[11]  operand32  opl; 

[12]  operand32  op2; 

[13] 

[14]  }; 

[15]  ... 

[16]  class  state#  {...}; 

[17]  class  State#:  public  state#  {...}; 

[18]  //  User-defined  functions 

[19]  BT : : INT32  interp0p32#(  state#  S,  operand32  I  )  { 

[20]  with(S)  ( 

[21]  State# (mem, regs , flags)  : 

[22]  with(srcOp)  ( 

[23]  DirectReg32(r) :  BT :: Access (regs ,r) , 

[24]  Indirect32 (base ,  disp): 

[25]  let  b  =  BT: : Access (regs , base) ; 

[26]  addr  =  BT: :Plus(b,  disp); 

[27]  (  BT: : Access (mem, addr)  ) 

[28]  Immediate32(i) :  i 

[29]  ) 

[30]  ) 

[31]  }; 

[32]  state#  updateFlag32# (state#  S,  ...)  {■■•} 

[33]  state#  updateState32# (state#  S,  ...)  {•••} 

[34]  state#  interpInstr#(instruction  I,  state#  S)  { 

[35]  with(I)  ( 

[36]  M0V32_32 (dstOp ,  srcOp): 

[37]  let  srcVal  =  interp0p32# (S ,  srcOp); 

[38]  in  (  updateState32# (  S,  dstOp,  srcVal  )  ), 

[39]  ADD32_32 (dstOp ,  srcOp): 

[40]  let  dstV  =  interp0p32# (S ,  dstOp); 

[41]  srcV  =  interp0p32# (S ,  srcOp) ; 

[42]  res  =  BT: : Plus (dstV,  srcV); 

[43]  S2  =  updateFlag# (S ,  dstV,  srcV,  res); 

[44]  in  (  updateState32# (  S2,  dstOp,  res  )  ) 

[45]  ) 

[46]  } 

[47] }; 


Figure  A. 5  The  CIR  generated  from  Fig.  A.4. 
(The  superscript  #  is  used  to  abbreviate  the  actual 
generated  names  used  in  the  TSL  implementation.) 


Figure  A.4  A  TSL  specification  of  a 
simplified  IA32  concrete  semantics;  reserved 
types  and  function  names  are  underlined. 
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Basetypes 

Map-basetypes 

constructors  that  handles  concrete  basetypes 

constructors  that  handles  concrete  map-basetypes 

bool  approximates(const  T  &  a) 

bool  approximates(const  T  &  a) 

T  join(const  T  &  a,  const  T  &  b) 

T  join(const  T  &  a,  const  T  &  b) 

T  widen(const  T  &  a,  const  T  &  b) 

T  widen(const  T  &  a,  const  T  &  b) 

T  meet(const  T  &  a,  const  T  &  b) 

T  meet(const  T  &  a,  const  T  &  b) 

bool  isBottom() 

bool  isBottom() 

void  setToBottom() 

void  setToBottom() 

static  T  BTM() 

static  T  BTM() 

bool  isTop() 

bool  isTop() 

void  setToTop() 

void  setToTop() 

static  T  TOP() 

static  T  TOP() 

bool  operator  == 

bool  operator  == 

bool  operator  > 

TVL::Bool  isEqual(const  T  &  a) 

TVL::Bool  isEqual(const  T  &  a) 

std::ostream  &  print(std::ostream  &  o) 

std::ostream  &  print(std::ostream  &  o) 

T  mapUpdate(const  T  &  m,  const  K_T  &  key) 

D_T  mapAccess(const  T  &  m) 

Figure  A. 6  Required  operators  that  an  abstract  domain  must  provide  for  a  TSL  reinterpretation; 
T :  the  abstract  domain  for  a  maptype;  K_T :  the  key  type  of  the  map-type  T ;  D_T :  the  datum  type 
of  the  map-type  T ;  TVL::Bool:  a  three  value  logic  (FALSE,  ONE,  and  MAYBE); 


[1]  template  <typename  INTERP1,  typename  INTERP2> 

[2]  class  PairedSemantics  { 

[3]  typedef  PairedBaseType<INTERPl :  :  INT32 ,  INTERP2  :  :  INT32>  INT32; 

[4] 

[5]  INT32  MemAccess_32_8_LE_32 (MEMMAP32_8_LE  mem,  INT32  addr)  { 

[6]  return  INT32  (INTERP1 :  :MemAcceSS_32_8_LE_32(mem.GetFirst()  ,  addr  .  GetFirst  ()  )  , 

[7]  INTERP2 :  :MemAccess332_8_LE_32(mem.GetSecond() ,  addr . GetSecondO  ) ) ; 

[8]  } 

[9]  }; 


Figure  A. 7  A  part  of  the  template  class  for  paired  semantics. 
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[1]  typedef  PairedSemantics-CVSAJENTERP ,  DUA_INTERP>  DUA; 

[2]  templateO  DUA::INT32  DUA:  :  MemAccess_32_8_LE_32(  DUA:  : MEMMAP32_8_LE  mem,  DUA::INT32  addr)  { 


[3] 

DUA: 

: INTERP1 : 

MEMMAP32J3_LE  memoryl 

=  mem.GetFirstO  ; 

[4] 

DUA: 

: INTERP2 : 

MEMMAP32_8_LE  memory2 

=  mem. GetSecondO  ; 

[5] 

DUA: 

: INTERP1 : 

INT32 

addrl  =  addr . GetFirst () ; 

[6] 

DUA: 

:  INTERP2 : 

INT32 

addr2  =  addr . GetSecondO ; 

[7] 

DUA: 

:  INT32  answer  = 

interact  (meml , 

mem2,  addrl,  addr 2) 

[8] 

return  answer 

[9]  } 

Figure  A. 8  An  example  of  C++  explicit  template  specialization  to  create  a  reduced  product. 


[1]  with(op)  (  ... 

[2]  Indirect32 (base ,  index,  scale,  disp) : 

[3]  let  addr  =  base  +  index  *  SignExtend8To32 (scale)  +  disp; 

[4]  m  =  MemUpdate_32_8_LE_32 (mem, addr  ,v)  ; 

[5]  ...) 


Figure  A. 9  A  fragment  of  updateState32. 
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Appendix  B:  Semantic-Reinterpretation  for  Symbolic- Analysis 
Primitives 

In  this  appendix,  we  give  correctness  proofs  for  our  generated  primitives  for  symbolic  evalua¬ 
tion,  WCV,  and  symbolic  composition.  These  apply  to  the  language  PL  (§4.2.2)  and  reinterpreta¬ 
tions  given  in  §4.3;  the  proofs  for  MC  differ  only  slightly. 

As  a  notational  convenience,  we  do  not  distinguish  between  a  State  and  a  LogicalStruct.  A 
LogicalStruct  t  corresponds  to  the  State :  ((t  ]  1),  (i]2)Fp).  Because,  for  PL,  logical  structures  only 
contain  the  single  function  Fp,  there  is  a  one-to-one  correspondence  with  states.  Hence,  whenever 
necessary  (e.g.  in  the  applications  of  £[.],  £>[.],  and  J[.J),  we  assume  that  that  a  LogicalStruct  t  is 
coerced  to  ((L|T),  (t|2 )FP). 

B.l  Correctness  of  the  Symbolic-Evaluation  Primitive 
Lemma  B.l  (Relationship  of  £  to  S  and  B  to  B ) 

(1  )TI£IE}U}l  =  EIE}(UIU}l) 

(2)  FfBlBEjUji  =  BlBEj(UlUji) 

Proof:  The  two  lemmas  are  simultaneously  proved  using  structural  induction  on  E  and  BE,  as 
shown  below.  Let  U  be  ({/,;  <-^Ti},{Fj  FEj}). 

Note  that  the  standard  interpretations  of  binop,  relop,  and  boolop  coincide  with  those  of  binopL, 
relop L,  and  boolop L.  Thus,  reasoning  steps  of  the  form  binop L(op2L)  binop {op2 )  are  short¬ 
hands  for  reasoning  about  each  case,  such  as  binop L (|~+~|)  binop(+),  etc. 

(1)  (0 

T[£[c][/]t  =  T\const{c)\i 
=  r\c\i 
=  const(c ) 

=  £[c](M[17|0 
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W 

lhs 


nmuv 


rhs 


'T\lookupState  U  /]  t 

Tl((U12)Fp)((Un)I)b 

sinmuji) 

in)[ii^n(un)uu 

42)[F^  ^[([712)^10 

L.nm^TKunm], 

l]2)[F^  rsiiuWFjW t 
((42) [Fi  ^  FSKUWFjji)  (T[(f/tl)/]0 

(FSl(U12)FPli)  (T\{un)m 

access^ £{{U]2)Fp\i,  T [(E41)7]t) 

Tl((U12)Fp)((Un)I)b 


m 

lookupState 


(in) 

lhs  : 
rhs  : 


T[£[&/]£/|t  =  r\lookupEnv  U  Ijt  =  T[(f/|l)/]t 
£[&7](W[E/]0 

41) [/l^T[(f/Ti)/i]4 

42) [4^^[(f/t2)F40 

(41) [/^r[(f/ti)/?:]4 

(42) [Fi^^[(£7T2)Fj]0/y 

T[(f/tl)/]. 


lookupEnv 
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O) 

lhs  : 


=nsi*Ejuji, 


=e[*m 


=  lookupStore 


=  T\lookupStore  U  (£[#]]  C/)]]t 
=  Tim2)Fp)(£lEjU)ji 
rhs  :  SI*E}(UIU}l) 

(42)[F^  TEliU^Fjii) 

\tnrn  ^  nmmH 

(42 )[Fj  ~  ^8[(U^2)FjlL)(S[El(UlUli))i 

=  (E£\(U]2)Fp\l)  (SIE}(UIU}l)) 

=  {E£{{U]2)Fp\l)  (Tumult)  //  by  ind.  via  (1) 

=  access^ £\{U]2)Fp\p  T|£[£]L/]i) 

=  nm2)Fp)(£lEjU)jt 

(v)  T\E\Eiop2  E2\U\l 

=  7\E\EX\U  oP2l  £IE2}U}l 
=  TinEijUji  binopL(op2L )  T\EIE2\U\l 
=  SlE^UlUjt)  binop(op2 )  £[£2J(W[E/]|0 
//by  ind.  via  (1) 

=  SlE1op2E2l(UlU}L) 


(vi)T\E\BE2  E1:E2jUji 

=  TliteiBlBEjU.SlE^ElE.jU)^ 

=  condL(ElBlBE}U}L,  T[£[£2]£/]0 

=  ?  ns^jujt :  r[f  [£?2]£/]t 

=  B[fi£l(M[i7|0  ?  SlE^UlUji)  :  £[£2](W[£/]0 
//  by  ind.  via  (1)  and  (2) 

=  £lBE2E1:E2j(UlUjt) 

(2)  (i)  jF[B[T]|tf]<,  =  EIT}l  =  T  =  BlTKUlUji) 
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(a)  nMnuh  =  nn^  =  f = «  wiw 

(Hi)  E\B\E  1  rop  E2juji 

=  El£lE1}UmpL  EIE2}U}l 
=  TlElE.jUji  relop L(ropL) 

^ElE.KUlUji)  relop  (rop)  £[£2J(W[£/]|0 
/ /  by  ind.  via  (1) 

=  51^!  wp^l(W[£^|0 

(™) 

= 

=  ~^B\BEi\(U\U\l)  //by  ind.  via  (2) 

=  Bl-.BEMUji) 

(v)  EIBIBE-,  bop  BE2\U\l 

=  ElBlBE1}UbopL  BIBE2\U\l 

=  ElBlBE.jUji  boolopL(bopL )  p 

=  BlBE^UlUjt)  boolop(bop)  B\BE2}{U\U}l) 

//by  ind.  via  (2) 

=  B\BE1  bop  BE21(U\U\l) 

Theorem  4 .4  For  all  s  G  Stmt,  U  G  StructUpdate,  and  i  G  Logic alStruct,  the  meaning  of  X\s\U  in 
i  ( i.  e. ,  Z4|X[[.S']t/]/J  /.S'  equivalent  to  running  X  on  s  with  an  input  state  obtained  from  U  ff/J  i.  That 


UiX{s\U\L.  =  Xis\(UiU\L). 


□ 
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Proof: 

(i)  Wp[/  =  E;jUji 

=  U\updateStore  U  (lookupEnv  IJ  /)  (£[.E][/')]t 
=  U\updateStore  U  ((£L|1)  J)  (ElEjU)jt 

=  ul(<un)' 

[  y(J7T2)[F(9  h-  ((U]2)Fp)[(Un)I  ~  S\E\U]\) 
=  ( (W[L/]41), 

\mu}LV)[nmuwi - 
=  ( myv)<  \ 

\mu}LV)mu¥n)i - smmuM ) 

//by  Lem.  B.l(l) 

=  updateStore  {U  |{7]t) 

(i lookupEnv  (£V[{7]t)  I) 
(EIE}(UIU}l)) 

=  llI  =  E-j(UlUjt) 

(ii)U$Il*I  =  E;  flC/flt 
=  U\update Store  U  (£{I}U)  (£[£]C/)]t 

=W|/Vn), 

|V(^T2)[Fp^  ((C/T2)Fp)[£[/]C/^£[JE]C/]]y|  ' 

=  /  (W[c/14i),  \ 

lv(W[C/]42)[T^/](W[C/]t)l  ~  T|p[^(W[i7|0]]/ 

=  ( muw,  \ 

\(UIU}lV)[£II}(UIU}l)  »  ElE\{UlU\i)]) 

//by  Lem.  B.l(l) 

=  updateStore  (U{U}l)  (£|J](W[£/]|fc))  (£[£](W[£/]|0) 

=  J[*/  =  JE?;](W[£^|0 


246 


(in) 

=  (ums2msiim) 

=  X[S'2](W[T[Si]£/]i)  // by  induction  D 

=  ^{S2j (X[£iJ {lA{U\i))  //by  induction 

B.2  Correctness  of  WCV 

Lemma  B.2  (Relationship  of  T  to  T,  T  to  T,  TE  to  TE) 

(1  )TITIT}U}l  =  TIT}(UIU}l) 

(2)  nFMuv= nvWPV) 

(3)  FSlTSlFEjUji  =  FSIFE}(UIU}l) 

Proof:  The  three  lemmas  are  simultaneously  proved  using  structural  induction  on  T,  tp,  and  FE, 
as  shown  below.  Let  U  be  ({It  <— >  T{},  {Fj  FEj}).  (Thus,  Tt  =  (f/|l)/j  and  FEj  =  (U{2)Fj.) 
Let  /  be  (42) [Fj  i->  FElFEjji]. 

(1)  (i)  TlTlcjUjt  =  T[c]i  =  const(c)  =  T[c](W[£/]|0 

(n) 

lhs  =  TfTlIjUjt  =  T\lookupId  U  Ijt  =  T[(E41)7]t 
rhs  =  nij(UlUji)  =  T[/]((41)[/i  -  T{TM,S) 

=  lookupld  ((41)  [/j  ^  T{TM  f )  I 

=  r\{u]i)i\i 

(in)  nnr \  op2L  T2jUjt 

=  TITIT1}U  oP2lTIT2}U}l 
=  nrinjUji  binopL(op2L )  TfT\T2\U\i 
=  TlT^UlUlt)  binopL(op2L )  T[T2](W[(7]0 
//by  ind.  via  (1) 

=  T{T1oP2lT2\{U{UIl) 
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(iv)  TlTlite(<p,TuT2)}U}L 

=  T[ite(%][/,  T[Ti]£7,  TIT2}U)}l 
=  condL(HHMi,  nrmjuji,  nnT2jujt) 

= nn<pW¥  ?  nnTmi  ■.  mmuv 

=  ^M(UlUji)7TlT1j(UlUji)  :  T[T2](W[£^|0 
//by  ind.  via  (1)  and  (2) 

=  ^?Ti:T2](W[7/]0 

(n)  T[T[F£(T)fl[/]t 

=  T[^[JF£]f/(T[T]t/)]. 

=  (FEIT£IFE}U}l)(TITIT}U}l) 

=  (F£IFE}(UIU}l))(TIT}(UIU}l)) 

//by  ind.  via  (3) 

=  TlFE(T)}(UlU}i) 

(2)  (i)  =  t  =  rtfimmi) 

(a)  nn&u^  =  =  f  =  n&mm 

(m)  nn Ti  ropL  T2\U\l 

=  HU TijU  relop L(ropL)  T[T2]C/]t 
=  TlTlnjUji  relop L{ropL )  T[T[T2]£7]t 
=  TlTiKWl^O  relop L{ropL)  T[T2](Wp7]t) 
//by  ind.  via  (1) 

=  ^[T1ropLT2](W[7/]0 

(tv) 

=  F^]Tlp1}U}L 

=  ^T\?iVilUV 

=  -'Hviiiupi1')  //by ind- via  (2) 
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(v)  FlFl<Pi  bopL  <P2j U\l 

=  HHfiW  boolopL(bopL )  ^l^jUjL 

= nn^iuji  boohpL(boPL )  nn^juji 

=  ^{(Pijiulujt)  boolop L {bop L )  ^M(W[£/Jt) 

//by  ind.  via  (2) 

=  T{pi  bopL  (p2j(UlUjt) 

(3)  (0 

lhs  =  F£fF£[Fj[/Jt 
=  F  £\lookupld  U  F] i 
=  T£l{U]2)F\i 
rhs  =  F£\F\{U\U\i) 

=rsm((in)[ii»nTMf) 

=  lookupFuncId  ((t|l)[/j  i— >  T[Tj]t],  /)  F 
=  F£\{U]2)F}i 
(ii)  F£W£\FE,[Tl  ^  T2]jUji 

=  F£i{TEiFE4u)[rmu  ~  TIT2}U]}l 

=  F£l(T£iFE4u)wmTiW¥  ~  Tfrmujt]  q 

=  FSlFE0}(UlUli)[TlT1l(UlU}i)  -  T\T2l(U\U\i)} 

//  by  ind.  via  (1) 

=  F£\FE,\T^T2]\(lA\U\i) 

Theorem  4.9  For  any  Stmt  s  and  Formula  p,  i/j  F\p\  (X[s]  Uui)  is  an  acceptable  Yd  CP  formula 

for  p  with  respect  to  s.  □ 

Proof:  For  all  i  e  LogiccdStruct, 

n^=nnmis\Uidw 

=  nvWmUidV)  //by  Lem.  B.2 
=  nmisWiUM)  //  by  Thm.  4.4 
=  FM(J[S]0 
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and  therefore,  by  Defn.  4.8,  ^r[<^](X[s]f/;y)  is  an  acceptable  WCV  formula  for  p  with  respect  to 

s.  □ 


B.3  Correctness  of  the  Symbolic-Composition  Primitive 

We  now  show  that  the  meaning  ofUp2pi  is  the  composition  of  the  meanings  of  U2  and  U\. 
Theorem  4.11  For  all  U\ ,  U2  G  StructUpdate, 


Upp2pil=Up2\  O  Up!}. 


□  Proof:  Let  U2  =  ( { I,  T,},  {Fj  <— >  FEt } ) ;  let  Ik  and  Fm  range  over  Id  and  Funcld,  respec¬ 
tively;  and  let  t  G  LogicalStruct  be  an  arbitrary  logical  structure. 


=  U 


=  U 


uiupMc 

{IhWpi^TElFEjllh] 

\h  - 

,{Fm  h-  {p1{2)[Fj^TElFEip1])Fm}i 

(41)  [4  ~  r[((^Ti)[/i  ^  TiTiju^hU 

(42 ) [Fm  ~  FSl((UF2)[Fj  ~  T£\FEj\lh})Fm}L] 
(4i)[4(#i)  ~  2mn)4MA  - 

(tT2)[Fm(^)  >->  F£\{UiJ\2)Fm\i]  [Fj  »  F£ \TE 

(41)[4(^)  ~  T[(Uin)hfr][Ii  ~  T[Ti](W[?7i]t)], 

[i]2)[FmW)  ~  F£\{U^2 )Fmji][Fj  ~  F£\FEj\{U\lh\L)\i 

//by  Lem.  B.2 

((upili)n)[ii»rm(upiM, 

44[f/1]0T2)[FJ  -  FSlFEj imUiM, 

=  UlU2j(UlU1jt) 


□ 


